Human pose estimation is one of the fundamental tasks in computer vision,aiming to estimate all the keypoints of the human body in an image or video.Human pose estimation provides technical support for other tasks,such as motion prediction,behavior recognition,gesture recognition,etc.,and is widely used in motion capture,human-computer interaction and surveillance.In practical applications,the effectiveness of pose estimation is affected by numerous factors,such as complex environments,joint occlusions caused by human movements,mutual occlusions between people,and large differences in human scale in images,which make it difficult for the network to directly learn the features of the affected joints and thus affect the pose estimation accuracy.In order to obtain more accurate human pose estimation results,this thesis conducts research from the perspective of multi-scale feature fusion,which is mainly as follows:The MSRN,a multi-scale feature refined human pose estimation network based on the attention mechanism,is proposed to extract more effective multi-scale features,suppress redundant information in the multi-scale feature fusion stage,and enhance the representation of effective information.The network uses HRNet as the backbone network and replaces the fusion layer in HRNet with the proposed multi-resolution attention module.The network first generates multi-scale features through the parallel multi-resolution feature branch of HRNet,and then uses the proposed multi-resolution attention module to adaptively select appropriate information to participate in feature fusion according to the connection between different resolution features,avoiding information duplication and interference,thus enhancing the accuracy of network pose estimation.Experimental results on the MPII,COCO,Crowd Pose and OCHuman dataset show that MSRN can optimize the extracted multi-scale features,thus effectively improving the accuracy of elevated pose estimation.A multi-stage network-based human pose estimation method HRCNet is proposed,which can extract semantically richer multi-scale features and capture more keypoints spatial information.The network is a two-stage network,which consists of stacked HRNet and codec networks as well as intermediate supervision,cross-stage connection,and skip connection.The network first uses HRNet to extract semantically rich multi-scale features and output keypoints heat maps,then uses the codec network to gradually fuse different scale features,directly maps the keypoints heat map information output from HRNet to the codec network through intermediate supervision,and uses cross-stage connection and skip connection to map spatial information from the shallow network to the deep network,thus enriching the semantics of multi-scale features and improving the accuracy of pose estimation.The experimental results on MPII,COCO,Crowd Pose and OCHuman datasets show that HRCNet can effectively complement the joint spatial information and enrich the semantic information of multi-scale features,thus significantly improving the overall accuracy of network pose estimation. |