Font Size: a A A

Research On 3D Human Pose Estimation Method On Monocular Video

Posted on:2021-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:W N WangFull Text:PDF
GTID:2518306461458404Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
3D human pose estimation is a task to analyze the motion characteristics of human body in 2D images and transform them into the motion of human body in the corresponding 3D space.It is widely used in the fields of human motion analysis,behavior recognition and motion capture.However,due to the high dimensional characteristics of human body and the high complexity of human motion,3D human pose estimation is very complex,meanwhile,the external environment factors,and motion occlusion increase the difficulty of 3D human pose estimation.The final research purpose of 3D pose estimation is to build an effective model or algorithm to realize precise estimation of the human body.The state-of-the-art works solve the problem of human intraclass variability within class,non-rigid deformation,complex scenes and perspective transformation and so on.However,the application in the field of industrial and commercial needs higher accuracy for 3D pose estimation.On the basis of the existing work,how to further improve the accuracy in the case of complex background,occlusion and human deformation is the focus of the research on 3D human pose estimation.Starting from the monocular video based 3D human pose estimation,this paper aims to improve the accuracy of 3D estimation algorithm with the help of time information.The specific research contents and innovations are as follows:(1)Occlusion has a particularly prominent impact on the accuracy of many pose estimation tasks.In the case of severe occlusion,obtaining a reasonable 3D initial pose and ensuring its wholeness can improve the accuracy of the overall estimation.In order to estimate the 3D human pose with given 2D joint points for a monocular image sequence,the paper improve a method based on sparse representation.We use a convex relaxation method based on l1/2 regularization to solve the sparse representation model fused with the 3D deformable model to obtain the reasonable initial value for each single frame image.Compared with l1 regularization,the solution using l1/2 regularization is sparser and can present more reasonable initial pose for each image.At the same time,based on a learned base pose dictionary,the sparse model can effectively solve the regression problem of joint in the case of occlusion,and improve the effect of single frame estimation.On this basis,a penalty term based on geometric priori is introduced into the sparse model to make the 3D parameters between adjacent two frames not change too much and ensure the continuity of the video action.The qualitative and quantitative analysis on Human3.6M dataset shows that the estimation effect is better than the current excellent method,which proves the effectiveness of the proposed method.It is worth mentioning that the method based on sparse representation can effectively deal with the occlusion problem,which lays a foundation for the following work in this paper.(2)The error of 2D pose estimation is the main cause of the error of 3D pose estimation.How to map from 2D pose to the optimal and most reasonable 3D pose under the interference of 2D error or noise is the key to improve the accuracy of 3D human pose estimation.We proposes the Sparse Representation(SR)and Multi-channel Long Short Term Memory(MLSTM)denoising en-decoder(SR-MLSTM)based on the advantages of sparse representation and deep learning.Through the estimation method fusing sparse representation and deep model,the expressive spatial geometric pose is combined with the time model.Firstly,the sparse model is used to estimate the 3D initial value of a single frame to solve the ambiguity of 3D estimation and obtain a preliminary 3D spatial structure.Secondly,the imprecise 3D initial value is considered to be with noise.Different from the other method,the obtained single-frame 3D estimation was input into the proposed MLSTM denoising en-decoder in the form of sequences,and the 3D initial value was optimized by using the time-dependent relationship of the human pose between adjacent frames in the video and the time smoothing constraint.Experiments were performed on Human3.6M dataset in comparison with existing work.For two kinds of input data:2D coordinates given by the dataset and 2D estimated coordinates obtained by convolutional neural network,the proposed MLSTM denoising en-decoder based on time information in this paper shows a high precision estimation effect.In addition,it can also provide a reasonable 3D pose if the 2D pose estimation is not accurate.
Keywords/Search Tags:3D human pose estimation, Sparse representation, LSTM, Residual connection
PDF Full Text Request
Related items