Research On 3D Human Pose Estimation Based On Monocular Video

Posted on:2023-01-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Y Xu

Full Text:PDF

GTID:1528307136999249

Subject:Information security

Abstract/Summary:

PDF Full Text Request

3D human pose estimation aims to predict the position of joins of the human body in 3D space from images or videos.It is one of the research hotspots in the field of computer vision.It is widely used in many fields such as information security,human motion analysis and human-computer interaction.In 3D human pose estimation,estimating the accurate 3D pose from monocular images or video is the most challenging and promising research direction.Compared with using depth cameras or multi-view cameraes,monocular 3D human pose estimation technology is easier to be applied in many real-world scenarios.In the past decade,thanks to the development of deep learning technology,the performance of many deep learning-based monocular 3D human pose estimation methods has been rapidly improved.The research on monocular 3D human pose estimation still faces great challenges: the inherent depth ambiguity in pose mapping from 2D to 3D,that is,a 2D joint position can correspond to multiple 3D joints;lack of data,lack of special pose datasets,lack of outdoor datasets;occlusion problem,self-occlusion problem in single-person pose estimation task,person-object occlusion problem,serious person-person occlusion in multi-person pose estimation task.In view of the above three problems,this paper proposes an improved graph convolutional network framework to mine the motion correlation between joints to reduce the 2D to 3D depth ambiguity;this paper builds a multiview camera system to establish a 3D human pose dataset in traffic scenes to achieve real-world application;this paper establishes a two-person strong interactive human pose dataset and proposes an attention mechanism-based spatial interaction information integration network to alleviate the occlusion problem in multi-person 3D pose estimation.The main research contents of this paper are as follows:(1)Aiming at the problem of deep ambiguity when mapping from 2D pose to 3D pose in the monocular single-person 3D pose estimation task,a spatio-temporal graph convolutional network framework is proposed,which takes a continuous 2D pose sequence as input,the spatial information of input is modeled by the graph convolutional network,and the temporal information is modeled by the dilated convolution.In the graph convolution module,the motion state of each joint is learned from the mainstream dataset,the joints are classified according to the motion state,and finally design an adjacency matrix that represents the motion correlation between joints in general motion and an adjacency matrix that expresses the second-order neighbor connection relationship of severe motion joints based on the classification situation.The experimental results show that the designed adjacency matrixes can accurately guide the model in mining the potential motion relationship between the joints,which can reduce the depth ambiguity and obtain more accurate estimation results.(2)Aiming at the problem of depth ambiguity when mapping from 2D to 3D pose in the task of monocular single-person 3D pose estimation,based on work(1),a method for dynamically learning the motion relationship between the joints in input data is proposed.According to the spatial position information and motion speed information of each joint in the input data,the method designs dynamic spatio-temporal graph convolution and dynamic spatial second-order connection graph convolution to model the motion relationship between joints and the second-order neighbor connection relationship of joints with severe motion respectively.By fusing fixed graph convolution,dynamic spatio-temporal graph convolution and dynamic spatial second-order connected graph convolution,a novel dynamic learning of specific motion features graph convolutional network is constructed,and the motion correlation between joints is dynamically defined.A dynamic weight loss function is proposed in this network,which sets the weight according to the motion state of the joints to reduce the error of the joints with severe motion.Experiments show that the method can dynamically construct potential motion correlations between joints according to different input data,which can reduce the depth ambiguity under different poses,and be more robust to special pose estimation.(3)Aiming at the problem that the monocular single-person 3D pose estimation task lacks special pose datasets and is difficult to apply to specific scenes,a multi-view camera acquisition system is constructed to collect 3D human pose datasets containing different roles’ poses in traffic scenes.A framework for action recognition of different roles in traffic scenes is proposed.The constructed dataset contains various 3D pose annotation information and corresponding action category information of traffic police,pedestrians and cyclists.The proposed action recognition framework is based on the 3D human pose,takes monocular video as input,and obtains the action classification result of the target through the object detection module—the 2D/3D pose estimation module—the action recognition module.The object detection module and the 3D human pose estimation module are improved,and a dynamic adaptive graph convolutional network is proposed as the action recognition module.The experimental results demonstrate the feasibility of the proposed framework and overcome the lack of special pose datasets for 3D human pose estimation in specific scenarios.(4)Aiming at the occlusion problem in the task of monocular multi-person 3D pose estimation,a spatial interaction information integration network based on an attention mechanism is proposed to realize the prediction and compensation of occluded joints through the interaction information between the two parties in strong interactive motion.As the research basis of this method,a millionlevel strong interaction human pose dataset with 2D/3D human pose information is collected.In the attention mechanism-based spatial interaction information integration network,a cross-spatial attention encoding module is proposed to model the interaction information between the poses of the interacting objects at the same time.According to the interaction information,the current targets’ poses can be inferred through the opponent’s poses.The predicted joint information can alleviate the serious occlusion problem in multi-person strong interactive motion.Experiments show that by modeling the interaction information,it is possible to predict and compensate the occluded joints in the interactive motion,and obtain more accurate pose estimation results.

Keywords/Search Tags:

3D human pose estimation, graph convolutional networks, action recognition, 3D human pose datasets, attention mechanism

PDF Full Text Request

Related items

1	Research On Human Pose Action Recognition Oriented To Human-Machine Collaboration System
2	Research On Video Human Action Recognition Based On Pose Sequence
3	Research On 3D Human Pose Estimation Based On Attention Mechanism
4	A Research Of Pose Estimation And Action Recognition In Action Digitization
5	Study On Human Pose Estimaton,Tracking And Human Action Recognition In Videos
6	Research On Human Action Recognition Based On Skeleton Features
7	Research And Implementation Of Human Behavior Recognition System Based On GCN
8	Human Pose Estimation And Action Recognition Using Deep Neural Networks
9	Research On 3D Human Pose Estimation Based On Spatiotemporal Semantic Graph Attention
10	Research On Human Pose Estimation Based On Convolutional Neural Network