Three-dimensional hand pose estimation refers to obtaining information from image data,depth data,motion capture data or other sensor data,and using computer graphics,computer vision,neural networks or other algorithms to restore the key point position relationship of the human hand,thereby reconstructing the three-dimensional hand pose information including hand position,skeleton constraints,etc.The 3D hand pose estimation technology can be applied to various fields,such as film and animation,remote control,virtual meetings,etc.On the other hand,with the development of virtual reality,more and more scholars are aware of the advantages of virtual reality technology in education.In recent years,the cases of virtual reality in teaching experiments have gradually increased,among which there are many applications of virtual experiments in middle schools to help students deepen the memory and understanding of experimental operations.However,the current virtual experiments still use traditional mice or joysticks to operate,rather than using natural interaction methods(such as interacting with two hands directly).This kind of virtual experiment can only allow students to experience the experimental process,but cannot cultivate students’ experimental operation ability and experimental operation consciousness.In order to improve this,better virtual-real fusion technology needs to be applied to virtual experiments in middle schools.In particular,for most middle school experiments,hand interaction is the most important,so 3D hand pose estimation technology needs to be applied to middle school virtual experiments.However,in the middle school experimental scene,in addition to tracking accuracy and real-time performance,parity,easy scalability,and less interference to the user need to be considered.The existing commercial 3D hand pose estimation technology often relies on lots of expensive external equipment and requires users to bind hand marker points or wear gloves;the existing laboratory 3D hand pose estimation technology often does not pay attention to real-time,or relying on a highperformance graphics device to ensure real-time performance,which does not meet the requirement of parity and easy scalability.In response to the above special requirements,this paper designs and proposes a three-dimensional hand posture estimation system.The system does not introduce additional auxiliary equipment on the user’s hand,and only requires a commonly used RGB color camera for capturing hand image information in real time,and then restore it to the three-dimensional hand posture.Specifically,the research contributions include the following three parts:First,a hand bounding box extraction model is designed based on the convolutional neural network,which can extract the position and size of the hand bounding box from a single frame of RGB images.This model combines a deep sequential model with a multi-scale model,and introduces anchor points to enhance the training effect,so that it can track hand regions of different sizes over a wide range.Second,based on the convolutional neural network,a two-dimensional key point extraction model for the hand can be designed,and the twodimensional key point heat map of the hand can be extracted from the hand bounding box.This model is based on the architectural characteristics of the convolutional attitude machine,and uses convolutional layer splitting and knowledge distillation techniques to improve the expression ability of the final model from the model level and the training level.Third,a three-dimensional hand posture calculation module is designed.This module uses the coordinate information of two-dimensional key points,combines hand position and physiological constraints,and solves its threedimensional hand posture.Based on the properties of the Gaussian distribution,this module corrects the errors of the two-dimensional key point coordinates,then uses the distance and angle relationships to discretely and exhaustively calculate the three-dimensional coordinates of the key points of the hand,and finally corrects the displacement by hand model constraints.Compared with the traditional method,each module of the final system has improved performance and recognition effect.Among them,the prediction accuracy of the bounding box extraction module is improved by about 4.23 percentage points compared with the traditional sequential network model,which performs better at processing multi-scale input pictures;the prediction accuracy of the two-dimensional key point extraction module is improved by about 3.33 percentage points and the performance is improved by about 9.638 ms compared with the traditional convolution attitude machine;the processing time per frame of the whole system is about 37.814 ms,reaching the performance requirement of 24 frames per second.The system can handle common hand postures in virtual experiments(such as gripping,pinching and others),occlusion and selfocclusion,and long-range location well. |