| With the development of image and video acquisition technology,the amount of image and video shows explosive increase.Human beings are the main subjects in these data.Detecting human bodies to understand and analyze their behaviors has become a critical issue in the field of computer vision.Since the keypoint information directly reflects human body posture,the human pose estimation algorithm based on keypoint detection has become a research hotspot in the field of computer vision.Due to the high detection accuracy,this thesis mainly studies the top-down pose estimation algorithm.To alleviate the demand of the high computational cost in the top-down pose estimation algorithm,this thesis designs a lightweight pose estimation network called Light Net by integrating the parallel connection structure in the HRNet algorithm.In addition,the top-down pose estimation algorithm lacks the utilization of the correlation between keypoints.To strengthen the long relationship capture ability,this thesis optimizes the design of the top-down pose estimation network motivated by the self-attention mechanism.The contributions of this thesis are as follows:1.A lightweight human pose estimation network Light Net is designed.Combining the structural advantages of the classical pose estimation algorithm HRNet,this thesis designs a lightweight pose estimation network called Light Net.Based on the parallel connection structure of HRNet,this thesis proposes the lightweight feature extraction module and the multi-scale feature perception fusion module.The experimental results show that the parameters and inference calculations of Light Net-s are2.9M and 1.5G,respectively,which are 10.2% and 9.4% of HRNet.With the extremely lightweight network model,the mean average precision of Light Net-s is 62.9%,which only loses 17% performance compared to HRNet.Furthermore,experimental results show that Light Net-s with smaller amount of parameters overperforms other lightweight networks.In summary,Light Net achieves a good balance between detection accuracy and computational cost.Light Net greatly reduces the computational cost while maintaining performance,and significantly enhances the deployability of the top-down human pose estimation network.2.The performance of human pose estimation network is optimized based on self-attention mechanism.It is important for the human pose estimation task to capture long-distance dependencies.Since the convolutional neural network focuses on local information and has limited capture ability for long-distance relationship,this thesis proposes the Attention Network(Att Net).Att Net applies the self-attention mechanism to the pose estimation task to better capture the dependencies between keypoints and improves network performance.Att Net-s T4 with Light Net-s as the backbone network achieves a mean average precision of 66.6% with only 1.9M parameters.Att Net is better than many lightweight networks in both model size and network performance.Finally,by illustrating the selfattention heatmap and detection results,this thesis visually shows the contribution of the self-attention mechanism in capturing the correlation between keypoints.Furthermore,the thesis also shows that the self-attention mechanism has broad application prospects in the task of human pose estimation. |