Font Size: a A A

The Study Of Graph Convolutional Network For Human Action Recognition

Posted on:2022-08-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J N LiFull Text:PDF
GTID:1488306605989039Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
The task of action recognition aims to automatically understand and recognize humanoriented video signals,which is one of the most popular research directions in computer vision.Human action in a video generates a space-time shape in the 3D volume.Therefore,how to extract discriminative features with high representability and strong robustness is the key problem for action recognition task.Recently,with the development of deep learning,the deep neural networks achieve impressive performance for action recognition.Especially,the methods based on graph convolutional network(GCN)have obtained remarkable performance.Their flexible ability in semantic modeling has brought a new wave to the development of action recognition.However,it remains some problems for existing GCN based methods to be solved.Firstly,for spatial information modeling,the existing GCN-based methods construct a skeleton graph based on the nature structure of human body.These methods overlook the direct correlation between interactive parts.Especially for the two-person interaction recognition,the direct correlations between attended joints of different persons are largely ignored.How to make use of human knowledge to help the network focus on the interaction between key parts without the interference of irrelevant information is a problem worth studying.Secondly,in temporal information modeling,these methods only stack multi-layer1 D local convolutions to model motion dynamic,which is an inefficient way to model the temporal dynamic.Note that,the action in video is mainly characterized by the temporal dynamics rather than the static appearance.Therefore,how to fully explore temporal dynamic of skeleton sequence remains to be solved.Finally,the extraction of spatiotemporal features are explored by using multi-modalities including skeleton and RGB modalities.Skeleton data is an intrinsic high-level representation of human body.In contrast,RGB modality is the detailed representation(low-level)of image.Each single modality is unable to make an exhaustive representation for the action.Although existing GCN-based methods try to fuse skeleton and RGB modalities to represent spatiotemporal features.However,how to make use of the semantic consistency and complementarity between different modalities is a problem worth thinking about?To tackle these problems,we have carried out sufficient research and given the corresponding solution.The main research and contributions of this dissertation are summarized as follows:1.A Knowledge embedded Graph Convolution Network(K-GCN)is proposed for twoperson interaction recognition.In this method,the interactive pattern between two persons is considered as the discriminative feature for two-person interaction recognition.Therefore,two knowledge graphs are designed by exploiting the human knowledge for two-person interaction recognition.A knowledge-given graph is constructed to build the direct connection between two persons.Meanwhile,a knowledge-learned graph is proposed to build the adaptive correlations,which is unique for each input sample.Moreover,we further propose the K-GCN to exploit the complementarity among knowledge-given,knowledgelearned and naturally connected graphs for two-person interaction recognition.2.To capture complex temporal dynamic,a Temporal Enhanced Graph Convolutional Network(TE-GCN)is proposed to construct the graph structure on temporal dimension.Specifically,the constructed temporal relation graph is designed to directly capture the temporal dynamic between both adjacent and non-adjacent time steps.Meanwhile,to further explore the sufficient temporal dynamic,multi-head mechanism is designed to investigate multi-kinds of temporal relations.Therefore,the discriminative dynamic semantic features of skeleton sequence can be better extracted.3.To take full advantage of the complementarity of different modalities,Skeleton-Guided Multi-modal Network(SGM-Net)is proposed.The proposed network employs the designed guided block to fuse RGB and skeleton modalities in feature level.Specifically,the guided block exploits the action feature extracted from skeleton modality to guide the learning of RGB feature,which makes the designed network focus on the important RGB information strongly related to the action.Therefore,higher accuracy of action recognition is obtained.
Keywords/Search Tags:Graph Convolutional Network, Human action recognition, Deep learning, Multi-modality, Knowledge embedded
PDF Full Text Request
Related items