| Action recognition techniques based on skeleton data are currently receiving increasing attention in the field of computer vision due to their ability to better adapt to dynamic environments and complex backgrounds.Topologising human skeleton data as spatial-temporal graphs and processing them using graph convolutional networks(GCN)has been shown to produce good recognition results.However,existing GCN methods typically use fixed-size convolutional kernels to extract timedomain features,which may not be well suited to multi-level model structures and make the recognition accuracy lack.The equal scale fusion of different streams in a multi-stream network may ignore the difference in recognition ability of different streams,which will affect the final recognition results.Based on the above analysis,the contributions of this paper are as follows.(1)For the multi-level structure characteristics of the network,a multi-scale dilated temporal graph convolutional layer(MDTGCL)is firstly proposed.MDTGCL utilizes multiple convolution kernels and dilated convolution to better adapt to the multi-level structure of the GCN model and obtain longer contextual spatial-temporal information,thus obtaining richer behavioural features.(2)As the higher-order information of skeleton data(e.g.length and orientation of bones)is highly discriminative and more conducive to human action recognition,a multi-stream multi-scale dilated spatial-temporal graph convolutional network(MSMDSTGCN)model is designed in this study using spatial information of joints and bones and their multiple motions,as well as angle information of bones jointly modelled.(3)A multi-stream feature fusion(MFF)structure is proposed for the differential characteristics of different streams of the multi-stream network.the MFF structure is a weighted fusion of Softmax scores based on the results of the multi-stream output to obtain the final recognition results.(4)The attention mechanism has been widely used in recent years,which has a certain improvement on the recognition effect of the model.In this paper,we propose an attentional module,the migratable spatial-temporal channel attention enhancement module(STCAEM),which contains three sub-modules,the spatial attention enhancement module(SAEM),the temporal attention enhancement module(TAEM)and the channel attention enhancement module(CAEM),with attention processing in space,time and channel respectively.Combined with the above proposals,an attention-enhanced multi-stream feature fusion multi-stream multi-scale dilated spatial-temporal graph convolutional network(2M-ASTGCN)model is proposed.Extensive experiments were conducted on two large publicly available datasets(NTU RGB+D 60 and Kinetics Skeleton 400),and the results show that the model proposed in this paper achieves SOTA level. |