Font Size: a A A

Research On Deep Network Model Of Human Pose Estimation And Its Lightweight

Posted on:2024-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2558307124486304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human pose estimation is one of the basic problems in computer vision research,which can be traced back 20 years.With the improvement of the computing power of image processors and the popularization of big data,deep learning has become the mainstream method for studying human body pose estimation.Among them,convolutional neural network unifies human body structure modeling and feature extraction to achieve end-to-end optimization.However,it is also subject to many challenges such as lighting conditions,occlusion and truncation.In addition,due to the low computing power and limited hardware resources of terminals or embedded devices,higher requirements are placed on the inference speed of the model,making it difficult to achieve real-time positioning and recognition.Therefore,this paper studies the balance between model reasoning cost and accuracy from the perspective of model structure and model lightweight.(1)In view of the high cost of model reasoning and learning difficulties caused by the stacking of transformer encoders on the convolutional neural network,a Transformer multi-scale Cross-Attention learning method is proposed.This method is analyzed from the perspective of the redundancy of transformer encoder stacking.Reduce the use of transformer encoders by adding cross-attention modules.Secondly,the repeated interaction between multi-resolution feature tokens forms the exchange and fusion of feature information at different scales,which improves the generalization ability and robustness of the model.The entire model architecture consists of a convolutional network module for underlying feature extraction,a prediction module for keypoint heatmap tokens,a multi-scale cross-attention module,and a cross-attention fusion module.Experiments on benchmark datasets verify the effectiveness of the idea of multi-resolution cross-attention.The proposed method can effectively strengthen the encoder’s learning of keypoint relevance and significantly reduce the model’s inference without degrading performance.(2)To tackle the problem of limited computing and storage resources when deploying high-performance models,a self-distillation strategy is introduced to quickly reduce the weight of high-performance models,avoiding the multi-stage training costs of traditional knowledge distillation methods,and also avoiding online knowledge distillation Technology needs to avoid the auxiliary design of the high-performance model structure.When using the self-distillation strategy to train the model,eliminating the optimization contradiction between the teacher model and the shallow student model,an adaptive self-pose self-distillation strategy is further proposed,and the variable factor is used to adjust the relationship between the target loss and the distillation loss.Through experiments and in-depth analysis on multiple data sets,the effectiveness of the adaptive attitude self-distillation strategy is verified,and the overall model performance is also improved to a certain extent.
Keywords/Search Tags:human pose estimation, cross attention, knowledge distillation, Transformer encoder, lightweight
PDF Full Text Request
Related items