| With the increasing prevalence and widespread use of intelligent devices,there is a growing demand to accurately understand the semantic information conveyed by human body action.Accurate body action recognition can significantly improve work efficiency and quality of life,especially in fields such as sports analysis,elderly monitoring,and virtual reality interaction.Therefore,human action recognition has become an important direction that is highly valued in the field of computer vision.In the current stage of research,most of the human action recognition methods based on skeleton data have attracted attention due to the rich spatial and temporal features they provide,as well as their good environmental adaptability.These methods have achieved significant research results.This paper focuses on the Spatial Temporal Graph Convolutional Networks(ST-GCN)method for human action recognition,improves some of its shortcomings,proposes an improved ST-GCN method for human motion recognition.This method was extensively tested on the NTU RGB+D 60 and NTU RGB+D 120 datasets,demonstrating the effectiveness of proposed network.This investigation covers the following aspects:(1)In the ST-GCN,all feature channels are assigned equal attention without selectively enhancing the importance of feature information,nor reducing the attention on redundant feature information,which can affect the accuracy and robustness of the model.To address this issue,this paper proposes a channel attention module that encodes feature channels by utilizing two pooling operations in the channel enhancement component,generating a channel feature vector.Subsequently,a fully connected layer is used to facilitate information communication between channels,and obtain channel attention weights,enabling the network to focus more on relatively important channel information and extract more effective action features.(2)As graph convolution is a local operation,it can only utilize short-term trajectories,it cannot directly model the long-term temporal information required to differentiate between various actions.A multi-scale temporal dilation graph convolution module is proposed to address this issue.The module applies dilated convolutions with different dilation rates to extract features from non-adjacent time frames,and further captures contextual information of human actions.Subsequently,a set of sub-temporal graph convolutions models the temporal local information,and outputs multi-scale temporal feature information,thereby increasing the model’s temporal receptive field. |