| Human action recognition in computer vision has always been a hot research direction,which has a wide range of applications in the autonomous vehicles,video surveillance,and sports analysis.Since the skeleton data can not be affected by background and views and has small computational cost,skeleton-based action recognition has attracted a lot of research interests.In recent years,some researchers propose to use graph convolutional methods to mine the intrinsic correlation between skeleton joints as the feature representation of skeleton motion,and good results are achieved,however,there are still two difficult problems:(1)Due to the local properties of graph convolutional operations,they are prone to be confused by actions with similar movement clips.(2)It is difficult to distinguish the actions that the motion information is not obvious or the motion mode is basically the same but the interacting objects are different by only relying on the skeleton information.To solve the first problem,a graph convolutional network with significant movement identification is proposed.Specifically,a tri-attention module is proposed in this paper,which is implemented in three steps.First,a dimension permuting unit is proposed to enable the network to characterize the skeleton sequences from three different dimensions,which is conducive to more fully mining the discernable features.Then,a global statistical modeling unit is designed to measure the degree of movement variations by aggregating the first-order and second-order statistics of the global context from each dimension.Finally,a fusion unit is used to integrate the information from these three dimentions to form a 3D attention map to guide the graph convolutional network to focus on significant movement variations.Experimental results on two large datasets show that the proposed method can effectively alleviate the confusion caused by similar movement clips.To solve the second problem,an appearance semantic guided graph convolutional network for skeleton-based action recognition is proposed.Since the RGB images and skeleton information have good complementarity,the RGB modality is chosen to provide the appearance semantic information.To make the fusion network lightweight,the most active image is automatically selected to fuse with the skeleton sequence.Besides,the light-weight convolutional neural network GhostNet is used to extract appearance features in the RGB branch,and the light-weight graph convolutional network Light-GCN is used to extract significant movement variations in the skeleton branch.In addition,an appearance semantic guided feature enhancement module is designed to effectively integrate appearance semantic features and skeleton motion features.Appearance semantic features of the RGB branch are used to correct motion features of the skeleton branch,which make the network pay more attention to the action-related body parts of skeleton.The combination of the tri-attention module and appearance semantic guided fusion can effectively alleviate the two problems mentioned above and has the advantages of simple training and high computational efficiency.It has broad application prospects in the fields of human-computer interaction,motion capture and so on. |