| Human keypoint detection,as a fundamental task in computer vision,aims to locate various keypoints of a person,including but not limited to the nose,eyes,ears,and so on.This task can be extended to numerous downstream applications,such as virtual reality,movie pose capture,and sports fitness.Currently,many state-of-the-art algorithms strive for accuracy but end up with highly complex network models.This complexity leads to significant computational and temporal overhead during inference,making it difficult to achieve smooth detection performance in practical scenarios.Indeed,this issue has been acknowledged by many researchers,and several solutions have been proposed.However,most of these solutions primarily focus on compressing models in terms of parameter and computational complexity.Despite these efforts,the redundancy in model structures still leads to time-consuming inference in practical scenarios.Therefore,addressing the existing challenges,this paper places a particular emphasis on the computational and inference time aspects of the problem.It investigates lightweight models for human keypoint detection and designs suitable network architectures that strike a better balance between model complexity and detection accuracy.The main contributions of this paper can be summarized as follows.(1)This paper deeply studies the characteristics of the visual Transformer architecture and lightweight convolutional network,and proposes an innovative lightweight module called CSA-Block.The module first divides the input features into two groups,and one group is sent to the self-attention network to learn the interdependence between key points.Alternately using global self-attention mechanism and window self-attention mechanism at different depths of the network can better balance network complexity and detection accuracy;the other group is sent to a lightweight separable convolutional network to extract feature local edge textures information.Then these two sets of features are channel-spliced and sent to the channel attention network,which weights and amplifies important feature channels and suppresses unimportant feature channels,thereby enhancing the expressive ability of features.This paper builds a lightweight neural network with high-resolution multi-branch fusion,embeds CSA-Block into each branch of the network,and finally shows good results.(2)This paper construct a complete distillation system,employing feature-based and responsebased knowledge transfer for distillation training of student networks.And the idea of reparameterization is introduced into the student model.By reorganizing the parameters of the trained network,the network is made more lightweight under the condition that the detection accuracy of the network remains unchanged.After multiple cycles of training,a satisfactory distillation effect is finally obtained.(3)This paper construct a motion analysis system based on a key point detection network by the well-trained lightweight human body key point network model.The system functions include human body key point detection,motion counting and action similarity analysis.The system function modules and system response are tested,which shows that the system can detect the key points of the human body and analyze the related motions with a smooth picture. |