With the prevalence of social networking software and short video applications,multimedia information is exploding and how to infer the contents of massive video and picture data has become a hot area of research.This article will mainly study the human-object interaction detection under natural scenes,and skeleton-based action recognition problems.In the meantime,this article will propose algorithms for activity understanding and visual interactions to improve the accuracy of multimedia content understanding.The main research contents of this article are as follows:1.Two graph parsing algorithms for Human-Object Interaction Detection are proposed.Firstly the subject status of humans in HumanObject Interactions is highlighted through the design of graph structure,and attention mechanism along with human keypoints information are encoded into node representations.Meanwhile,a novel object node representation,in which interaction-related features are explored,is proposed to improve the accuracy of Human-Object Interaction Detection.2.A novel relative pose representation is proposed for human pose sequence recognition,enriching the feature information of pose sequence encoding.Through the self-designed two-pathway network,relative and absolute pose representations could be learned to achieve better performance on pose sequence recognition.3.Improving the network architecture of Spatial Temporal Graph Convolutional Networks.On one hand,by learning the relative position movement between different keypoints,action adaptive adjacency matrix could be obtained.On the other hand,integrating the body attribute of human keypoints into feature transformation partitioning could help improve the action recognition accuracy. |