Font Size: a A A

Multi-person 3D Pose Estimation Based On Monocular Vision

Posted on:2024-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:R C WangFull Text:PDF
GTID:2568307151453624Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Single-purpose pose estimation is a challenging and important task in the field of computer vision,and it is also an indispensable part of computer understanding of human movement and behavior.Human pose estimation can be divided into two sub-tasks: single-person pose estimation and multi-person pose estimation.With the development of computer hardware and deep learning technology,deep neural network has gradually penetrated into various fields of computer vision research,and promoted the rapid development of multi-person pose estimation.In this thesis,3D human posture estimation is realized based on deep learning.Firstly,the attitude optimizer is used to optimize the existing two-dimensional attitude estimation to obtain more accurate two-dimensional key point information.Then,the accurate three-dimensional key point coordinates are predicted through the enhancement network.Finally,the absolute depth estimation is carried out according to the human body position.At present,there are some problems in 3D attitude estimation,such as accuracy of 2D attitude estimation,shielding between people and absolute depth estimation in the case of many people.Based on the classical network model of human posture estimation,this thesis conducts research on the above issues,with specific contents as follows:(1)Attitude optimizer.In the multi-person detection of two-dimensional attitude estimation,no matter top-down or bottom-up attitude estimation method can solve all the problems at once,especially the close interaction between people and human scale changes,which will affect the accuracy of two-dimensional attitude estimation.Therefore,the Pose Refine Machine Dual Attention(PRMDA)is proposed in this thesis.By designing an pose optimizer,the branch structure of the dual attention network is utilized to balance the local and global representations of the output features,and further refine the key points.By replacing the size of convolutional blocks to reduce the number of parameters and combining SENet,the improved lightweight feature extraction network achieves a relative balance between the improvement of accuracy and the reduction of the number of parameters.When tested on COCO and MPII data sets,the baseline network was improved,with AP reaching72.5% and 89.6% respectively.Compared with the baseline network,the computing cost of COCO data set is reduced by 24%,and the AP index is increased by 0.9%,AP.5 index by 1.2%,and AR index by 1.9%.(2)Relative depth estimation.Due to the possibility of joint occlusion in the image,it is limited to recover 3D human posture from a single image.Based on Temporal convolution,the High Resolution Temporal Convolutional Network(HRTCN)proposed in this thesis utilizes the time information between the preceding and the following frames in the video.The multi-stage residual structure and multi-channel fusion are used to expand the effective field of perception area,so that the model can obtain more abundant multi-scale information.Moreover,the nonlinear convolution layer and channel attention module are added to the network at the end to improve the accuracy of the model.In this thesis,Human3.6M and Human Eva-I data sets were used to verify HRTCN,and the MPJPE of evaluation index reached45.4mm and 14.7mm,respectively.MPJPE was reduced by 1.4mm and 1.1mm,respectively,compared to baseline networks.(3)Absolute depth estimation.At present,many researches on root joint depth estimation depend on the size of boundary frame,which leads to some errors in depth data.In this thesis,a root joint location network Rootloc Net based on detection frame restriction was proposed.Res Net in the baseline network was replaced with Conv Next network,and boundary frame restriction was added to determine whether the posture was stretched or curled according to the aspect ratio of detection frame.Finally,a correction factor γ was generated to adjust the area of detection frame.The influence of the size of the detection box on the depth is reduced,so that more accurate depth information can be obtained.The accuracy of Rootloc Net network reached 84.5% on the Mu Po TS-3D dataset,and the MRPE index was reduced by35.2mm compared with the baseline Root Net on the Human3.6M dataset.
Keywords/Search Tags:Multi-person 3D pose estimation, Convolutional neural network, Attention mechanism, ResNet of ResNet, Temporal Convolutional Network, Depth estimation
PDF Full Text Request
Related items