| Human pose estimation is a fundamental technique for computer vision,with applications for several tasks such as action/activity recognition,action detection,human tracking,humancomputer interaction,video surveillance,movies and animation,virtual reality,medical assistance,sports motion analysis,etc.In the past decade,human pose estimation has made substantial progress based on deep learning.While the variety of human outlines,limbs overlap,keypoints occlusion,and crowded people are the main factors that lead to incorrect detection or ambiguous classification of joints.For producing discriminative contextual information,while they require a lot of computing resources,the deeper convolutional neural network(CNN)and high-resolution representations are considered as means to suppress the systematic error.Efficient and accurate methods are the primary requirements for human pose estimation.Towards those problems,the feature extraction and fusion methods in the existing pose estimation network are analyzed deeply,and a lightweight framework is proposed.The main contributions of this thesis are the following.Firstly,this thesis revisits that it is worth using cascaded dilated convolution for human pose estimation tasks to obtain multi-scale features at the same spatial size.After analyzing local information loss of the cascade framework,the cascaded residual dilated convolution(CRDC)is proposed to strengthen the information stream for involving precise location context.As a plug-and-play module,the CRDC,with a group of small dilation rates,captures multi-scale contextual information and mixes features to predict human keypoints at a very low computation cost.Secondly,in order to enable the efficient application of network models in limited computing resources,this thesis proposes a lightweight unified framework: Bilateral Pose Architecture(Bi Pose).One branch of the architecture extracts low-level spatial location information with low computational resources,the other branch uses a lightweight backbone extraction network to generate high-level semantic information.And a special fusion module is designed for features from different branches.The architecture increases inference speed while maintaining the model’s predictive accuracy as much as possible.Finally,for different hardware resource constraints,different lightweight networks are designed to meet different computing resource requirements,based on ordinary convolution,separable convolution,dilated separable convolution,and micro-factorized convolution.At the same time,a complex activation function is introduced to suppress the performance degradation caused by reducing parameters and computation.And a pose correction loss also is proposed to supervise the network to recognize poses and correct wrong strange poses. |