| Hand gesture recognition,as a new human-computer interaction method,has a very broad application prospect.The hand pose estimation based on 2D vision is one of the pre-tasks of hand gesture recognition technology.Therefore,it is of great significance to realize a lightweight hand pose estimation.In this paper,a top-down structure of human hand pose estimation is adopted,and the overall framework is divided into two stages of human hand position detection and human hand keypoints extraction,and then the detailed design and optimization of the whole process is carried out,mainly as follows:(1)In the first phase of the top-down framework,a vision-based object detection model is used,and the computing units in Shufflenet are adjusted to serve as the backbone network of the YOLO model.The lightweight design is completed,ensuring that the detection performance approximates the initial algorithm,and the consumption of computing and storage resources is significantly reduced.In addition,the loss function was redesigned to address the uneven distribution of positive and negative samples,and a confidence loss function based on Focal Loss was ultimately used to improve detection performance.For the error caused by the inference delay of the hand detection model and the irregular motion of the human hands after the camera collects the image,the improved Kalman Filter algorithm is used to compensate the error,and the differential signal is added to the process noise covariance to quickly adjust the confidence of the Kalman coefficient to the observed value and the predicted value when the human hand is moving irregularly,and the calculation process of adaptive adjustment of the hyperparameter is added to ensure the robustness of the algorithm.(2)In order to ensure the adaptability and equipment friendliness of the algorithm,a regression-based approach is used to realize the keypoints detection network in the top-down pose estimation.From the relationship between the loss function and the output potential distribution,this paper analyzes why the performance of the regression model is weaker than that of the heatmap-based model,and decides to use the generation model to fit the output potential distribution to continuously adjust the loss function participating in the supervision training.The cascade calculation process is re-parameterized and optimized into an end-to-end inference process.In order to prevent the wrong distribution at the initial stage of training from making the training of regression model deviate from the correct direction,the two are decoupled to realize the incremental learning process of distribution fitting,and ensure that the learning of regression model at least comes from the relatively correct initial distribution.A plug-and-play attitude smoothing filter is used to model the position,velocity and acceleration simultaneously to improve the effect of real-time operation.Based on the public datasets and the data collected and labeled by ourselves,the test dataset was made,and the position detection model and the keypoints regression model proposed in the paper were evaluated for a number of test metrics.The comparative results showed that the two methods proposed had high performance under the condition of ensuring lightweight. |