| 3D human poses and shape estimation is a challenging and widespread problem in computer vision.While promising results have been achieved,the field still faces challenges,such as the self-obscuration problem and the ambiguity and blurring when converting 2D poses from single-view images to 3D poses.Applying graph convolutional networks to feature extraction from models is an effective way to solve the above problem.The graph convolutional network can improve the estimation accuracy by effectively aggregating the edge information among key nodes and capturing the correlation among nodes.Multi-view images and video data can provide human depth information,more comprehensive human appearance information,and include occlusion information,which can solve problems such as occlusion and pose ambiguity.The main research elements of this thesis are summarised below:(1)In response to the occlusion problem of current monocular 3D human pose and shape estimation methods,this paper proposes a new method to effectively learn the relationship between adjacent vertices in a template mesh by introducing 3D human mesh templates into a graph convolutional network.In order to achieve effective interaction between local and global information and extract more efficient spatial and channel information,this paper proposes the polarised cross-fusion Transformer encoder.Meanwhile,the encoder’s ability to handle shielded regions is improved using shielded vertex modelling.The SMPL model shape and pose parameters are predicted using a hierarchical Matrix-Fisher distribution.Experiments show that the introduction of graph convolution and polarised cross-fusion attention enhances Transformer’s ability to model both local and global features.(2)In this paper,we offer a novel method based on multi-view fusion and MixGraphormer encoder to address the problems of occlusion,pose ambiguity and depth ambiguity in monocular 3D human pose and shape estimation.The method uses the fusion of images from different views to obtain more comprehensive information on human depth,appearance and occlusion.It establishes point correspondence between different views using epipolar geometry.Extraction of features from target human templates and multiview fusion heatmaps with the Mix-Graphormer encoder and integration of target information to obtain more comprehensive global and local structural information.Experiments have shown that Mix-Graphormer can effectively model local and global structural features and improve prediction accuracy.(3)As video data has time series information,multi-view information and richer pose data,this paper proposes a 3D human pose and shape estimation model based on video and Mix Spaitial-Temporal Transformer to construct comprehensive depth and temporal information by using video before and after frame information.The model consists of a temporal encoder,a spatial encoder,and a mixed spatiotemporal Transformer.To better model spatial dependencies,a random masking strategy is used.The Mix Spaitial-Temporal Transformer fuses and interacts with the temporal and spatial information to obtain the associativity of spatial location and temporal information.Experiments show that introducing the hybrid attention mechanism enhances the Transformer’s ability to fuse and interact with temporal and spatial information. |