Human pose estimation is a fundamental problem in computer vision community,which forms the cornerstone of a series of downstream tasks such as action recognition,human parsing,and pose tracking.It aims to identify and detect the joints positions of all people from the input data.Video-based human pose estimation supports enormous applications including security,violence detection,human-machine interaction,and augmented reality.However,frequent pose occlusion and motion blur in videos,as well as the time-consuming and labor-intensive manually annotation,dramatically increase the complexity of this task.Currently,many works focus on human pose estimation in static images.These approches inherently have difficulties in leveraging temporal context across video frames,and rely heavily on the visual feature of the current frame.Consequently,they usually fail in the scenes of pose occlusion and motion blur,which leads to inaccurate keypoint detection.On the other hand,existing models generally train models using the densely labeled dataset,yet neglect the collection and annotation process of video dataset.This paper emphasizes temporal consistency on video-based human pose estimation,focusing on the following two methods:(1)A deep dual consecutive network for huamn pose estimation.We employ consecutive video frames from dual temporal directions as supporting frames,and extract temporal information to improve pose estimation of the current frame.In particular,we design three components to implement the network.A Pose Temporal Merger encodes keypoint spatiotemporal context to generate effective searching scopes.A Pose Residual Fusion module computes motion cues in the short term.Finally,a Pose Correction Network comprising multi-granularity deformable convolutions is proposed for resampling keypoint heatmaps in the localized search scopes.(2)A multi-stream inference network for human pose estimation.Regarding the problem that existing approaches rely deeply on the visual cues of the current frame,we design a novel multi-stream inference network.The network incorporates bi-directional pose forecasts that are independent of the current frame visual features to achieve a superior complement to the visual detection results.Furthermore,considering the difficulty and high cost of labeling video datasets,we extend the network and apply it to sparselylabeled video scenes(pose annotations are given every frames).The extended network can accurately predict the pose sequences of the entire video by using a few annotated frames during the test phase,and hence simplifies the annotation process.Experimental results demonstrate that using temporal information can effectively improve the accuracy of keypoint detection in videos,significantly outperforming existing state-of-the-art pose estimation methods on multiple benchmark datastes.Additionally,when applying the proposed method to sparsely-labeled video scenes,we still achieve remarkable results at large temporal intervals. |