Action recognition has been a research hotspot in the field of computer vision due to its wide applications in human-computer interaction,intelligent surveillance,and video understanding.Compared with RGB video,the skeleton has better robustness to illumination,environmental changes,and changes in viewpoint to the camera,while the graph convolutional network can model the human skeleton topology more effectively.As a result,graph convolutional network-based human skeleton action recognition has drawn significant attention from academics.Although existing studies have achieved certain achievements,there are still problems such as insufficient exploration of the interaction between non-physically connected joints,a high cost of model inference,and difficulty in distinguishing different actions with similar motion trajectories.To address the above problems,three graph convolutional network models suitable for skeletons are proposed in this thesis.The following are the primary research contents:(1)An adaptive activation graph convolutional network is proposed to explore the interactions between non-physically connected joints more effectively and to reduce temporal redundancy information.Firstly,the similarity between joints in the embedding space is calculated as the weights of node-connected edges to learn the skeleton space topology adaptively.Secondly,richer spatio-temporal feature information is extracted using class activation maps and multi-stream network architectures.Finally,a temporal feature aggregation module is introduced in the network to reduce temporal redundancy by using dilated convolution jumps to aggregate frame-level features.The proposed method outperforms the classical two-stream adaptive graph convolutional network in two skeleton action recognition datasets,NTU RGB+D and NTU RGB+D 120,where the recognition accuracy of the adaptive activation graph convolutional network reaches 88.9%and 94.5% under the Cross-Subject and Cross-View of the NTU RGB+D dataset,respectively.The experimental results show that the proposed method is an effective method for human skeleton action recognition.(2)In order to effectively use the spatio-temporal semantic information latent in the human skeleton sequence and extract multi-scale features,a semantically guided multiscale neural network is proposed.Secondly,to enhance the representation of human motion features,joint type semantic information and frame index semantic information are respectively embedded into the spatio-temporal dimension modeling.Secondly,based on the original human skeleton structure,the multi-scale skeleton information that maintains dependence on the original skeleton is obtained by aggregating neighboring joints,and it is modeled by an adaptive graph convolution network to extract spatial multiscale features.Finally,by grouping the neurons of the temporal convolution network,the multi-scale temporal convolution network is constructed by using dilated convolution with different dilation rates to extract the temporal multi-scale features.Experiments are conducted on two skeleton action recognition datasets,NTU RGB+D and NTU RGB+D120,in which the semantically guided multiscale neural network achieves recognition accuracy of 90.1% and 95.8% with only 0.93 M parameters in the Cross-Subject and Cross-View of the NTU RGB+D dataset,respectively.The results show that the network model improves recognition accuracy while reducing computational costs.(3)In order to solve the problem that the model makes it difficult to distinguish different actions with similar motion trajectories,a topologically refined graph convolutional network based on multi-order features is proposed.Firstly,the angles formed between joints during human motion are unique,the angular features are encoded into the joint,bone,and motion information in order to improve the model’s ability to distinguish different actions with similar motion trajectories without adding additional training costs.Secondly,the joint,skeleton,and motion information after embedding the angular features are modeled by topologically refined graph convolutional networks to extract complementary spatio-temporal features,respectively.Finally,the network is designed with a spatio-temporal information sliding extraction module for enhancing the correlation of spatio-temporal higher-order feature information.The multi-stream network consisting of joint branch,skeleton branch,and motion information branch is experimented on three skeleton action recognition datasets,NTU RGB+D,NTU RGB+D120,and Northwestern-UCLA,in which the recognition accuracy reaches 92.8% and 97.0%under the Cross-Subject and Cross-View of the NTU RGB+D dataset,respectively.The superiority of the method is demonstrated through the experimental results. |