Three dimension human pose estimation aims at estimating 3D joint locations of human body parts.Due to its fundamental role in many applications,such as action recognition,augmented reality,and training robots,it has always been a hot research topic in the field of computer vision.Especially in recent years,with the continuous development of deep learning,many good 3D human pose estimation algorithms have been proposed,but there are still many problems to be solved.For example,in the task of monocular 3D pose estimation,limb joints(i.e.,wrist,ankle,etc.)have higher degrees of freedom(DOF)than others(i.e.,hip,thorax,etc.).As a result,the estimation errors accumulate along the physiological structure of human body parts,and their trajectories bring in higher complexity.Additionally,3D human pose estimation has developed into diverse frameworks according to different factors,involving the number of views,the length of the video sequence,and whether using camera calibration.These frameworks are incompatible with each other,severely limiting the deployment flexibility of 3D pose estimation algorithms.In response to the above problems,two pose estimation algorithms are proposed to improve the accuracy and compatibility of the model.The main contributions are summarized as follows:For the problem of accumulation of limb errors in monocular pose estimation,a 3D pose estimation algorithm based on skeleton angle(Limb Net)is proposed.The algorithm model includes a kinematic constraint aware network as well as a trajectory aware temporal module.Two kinematic constraints named relative bone angles and absolute bone angles are introduced in the kinematic constraint aware network,the former being used for building the angular relation between adjacent bones and the latter for building the angular relation between bones and the camera plane.The trajectory aware temporal module takes the temporal trajectories of joints as input and generates fused poses.As a joint result of kinematic constraints and trajectory networks,the problem of accumulated errors along the limbs is alleviated.Experiments verify the effectiveness of Limb Net on four public datasets.For the problem of poor model compatibility in multi-view pose estimation,a unified framework compatible with various camera configurations(MTF-Transformer)is proposed.MTF-Transformer can handle variable-length videos with different numbers of cameras,and is compatible with camera-calibrated and uncalibrated scenarios.It consists of a feature extractor,a multi-view fusion transformer(MFT),and a temporal fusion transformer(TFT).Feature extractor predicts 2D pose from images and encodes 2D coordinates and confidences into feature vectors;MFT adaptively measures the implicit relationship between each pair of cameras and reconstructs features through a relative attention module;TFT aggregates variable length temporal features,and output 3D poses.With these modules,MTFTransformer handles different application scenes,varying from a monocular-single-image to multi-view-video,from a calibrated scene to uncalibrated scene.MTF-Transformer demonstrates quantitative and qualitative results on three public datasets,and the experiments show that the algorithm generalizes well to dynamic capture with an arbitrary number of unseen views... |