| Action recognition is a very important and challenging task in the field of computer vision.It has great demand and application values in video retrieval,security monitoring,virtual reality and automatic driving,and it has attracted extensive attention in academia and industry.Human action can be recognized from multimodal sources,such as RGB and skeleton.Compared with RGB,Due to the simplicity of skeleton data and its robustness to illumination and viewpoint changes,it has attracted more and more attention of researchers in recent years.Compared with the classic convolution neural network(CNN),graph convolution network(GCN)has the inherent advantage of extracting features from graph structure data.It can extract features while maintaining the skeleton graph structure information.At present,great research progress has been made in skeleton based action recognition.However,the existing works still have the following problems:(1)Although the skeletal action recognition based on graph convolution network(GCN)maintains the graph structure information of the skeleton well in the feature extraction stage,the traditional full connection layer and softmax classifier are still used in the final classification stage,thus the graph structure information maintained in the early stage is discarded,resulting in suboptimal action recognition performance.(2)The joint and bone information in skeleton data are powerful sources for action recognition,and there exists complementarity between them,but the existing methods based on graph convolution network(GCN)use two independent streams to extract the features from joint data and bone data separately.Obviously ignoring the correlation between joint and bone information,leading to suboptimal action recognition performance.(3)Most existing methods only focus on single modal information of skeleton.However,some action classes heavily rely on objects that people is interacting or on some local subtle movements,which are not directly available from skeleton data.Therefore,we need to resort to the information provided by RGB modality as a supplemental source.To solve the above problems,this thesis conducts a series of research based on skeleton and RGB modality.The main research contents and innovations are as follows:(1)To address the first problem,a GCN-HCRF model is proposed to solve it.Different from the existing GCN based methods,the proposed GCN-HCRF method combines GCN with HCRF model for action recognition.Firstly,GCN network is used to extract the spatio-temporal features of each joint,and then we adopt hiden conditional random field(HCRF)for classification without destroying the skeleton graph structure.In this way,the proposed GCN-HCRF model makes full use of the structural information of the skeleton graph.In addition,in order to enable the HCRF model to directly guide the GCN network to extract more meaningful feature,we utilize the message passing strategy to realize end-to-end training.(2)To address the second problem,this thesis proposes a VE-GCN model.Different from existing methods which extract the joint and bone feature with two seperate network streams,also ignoring the relationship between them.The proposed VE-GCN network extracts the joint,bone feature and their relationship information in each VEGCN convolution layer of a single stream.In addition,to further improve the performance,we learn two non-physical connection matrices,including non-physical jointjoint and joint-bone adjacency matrix.Thus,with the VE-GCN convolution operation,the strongly related joint and bone information at any distance is aggregated to the central joint of the sampling area.(3)To address the third problem,this thesis proposes a Skeleton+RGB based two stream fusion network to make full use of the complementary information from different modalities,where the skeleton stream adopts the GCN to extract the skeleton feature,the RGB stream adopted the CNN to extract the RGB feature.In the information fusion stage,different from the score level fusion strategy used by existing two streams methods,it only simply adds the score of the two streams to obtain the final classification result,we propose to use the discriminative canonical correlation analysis(DCCA)method to fuse the skeleton and RGB features from the two streams at the feature level,which leads to better recognition performance than the score fusion strategy.The proposed model is evaluated on two large size datasets NTU RGB+D,NTU RGB+D 120,and two small size datasets N-UCLA and SYSU.The experiment results validate the effectiveness of the proposed model comparing with state-of-the-art. |