| The task of 3D human pose estimation refers to recovering the 3D pose of the human body from pictures,which has a wide range of applications in the fields of action recognition,human-computer interaction,and sports.This thesis focuses on the task of3 D human pose estimation in natural scenes.In natural scenes,there are problems such as pose diversity,joint occlusion,and inherent ambiguity,so this task is more challenging.Aiming at the above problems,this thesis fully mines the image information,data association and model characteristics to achieve better accuracy and practicability.The specific contributions of this thesis are as follows:(1)A method for 3D human pose estimation in natural scenes combined with relative depth is proposed.The relative depths of adjacent joints are used as weak labels to supervise image feature generation,so that image features capture 3D information and reduce redundancy.First modify the two-dimensional pose estimation network to obtain multi-scale features,then use the heatmap and pooling operation to obtain onedimensional image feature vectors,and finally use the Transformer network to predict the three-dimensional pose.The loss function includes supervised learning of labeled poses,weakly supervised learning of relative depth,and unsupervised learning of unlabeled data to achieve end-to-end training and inference.On the MPJPE index of the Human3.6M dataset,this method reaches 49.0 mm,and the effectiveness of the method in natural scenes is proved by the quantitative indicators on the MPI-INF-3DHP dataset and the visualization results on the MPII dataset.(2)A 3D human pose estimation method combining multiple spatial information fusion strategies is proposed.This thesis proves through theoretical analysis that in the task of 3D human pose estimation,the three information fusion strategies of graph convolution,linear layer and self-attention can be summarized into the same family of methods.Three strategies to accommodate changes in posture.Considering the sufficiency of the association and fusion of information,the spatial attention mechanism and the channel attention mechanism are used to fuse the three outputs and channel fusion respectively.This method is slightly lower than the comparison method on the MPJPE index of the Human3.6M dataset,and the effectiveness of each module in natural scenes is verified on two challenging datasets,MPI-INF-3DHP and MPII.(3)A multi-hypothesis 3D human pose estimation method based on Transformer and diffusion model is proposed to alleviate the problem that single-hypothesis output is difficult to deal with occlusion and ambiguity.First use weighted random sampling to discretize the distribution of joint points in the heatmap,and then use Transformer to generate a one-dimensional encoding of the distribution.Then use Transformer as the basic network of the diffusion model,and embed distribution coding for conditional multihypothesis generation.A Transformer unified architecture is formed,which simplifies the design and implements it efficiently.Aiming at the slow reasoning speed of the diffusion model,DDIM and DPM-Solver are used to speed up the reasoning process and improve the practicability.This method is superior to Wehrbein’s method based on the normalizing flow,and is close to the optimal effect of the same category of methods,reflecting the effectiveness of each module design. |