Font Size: a A A

Hierarchical Spatial-Temporal Semantic Primitives Based Human-Object Interaction Recognition In Videos

Posted on:2023-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2558306908967409Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
As society is gradually becoming more technological and intelligent,there has been an unprecedented boom in computer vision technology.To further understand the visual world deeply,researchers are no longer limited to simple scene recognition tasks such as object detection,but are turning to the understanding of various visual relationships in complex scenes.Among the various visual relationships,the human-object interaction relationship could represent the theme in a scene best.In this paper,we focus on the task of humanobject interaction recognition in video,which shifts its focus from spatial relationships to spatial-temporal relationships compared to traditional image-based human-object interaction detection.Most current video human-object interaction recognition algorithms use instancelevel spatial primitives and frame-level temporal primitives,which lack detailed information about interaction and ignore the hierarchical structure and composability of primitives,making it difficult to achieve good results.Therefore,in this paper,we propose an hierarchical spatial-temporal semantic primitives based human-object interaction recognition in videos,which uses fine-grained spatial-temporal primitives to provide detailed information and constructs more describable spatial-temporal features by combining primitives,so as to improve the accuracy of human-object interaction recognition.The main work of this paper consists of the following two parts:Firstly,in this paper,we propose an hierarchical spatial-temporal semantic primitives based human-object interaction recognition to model huamn-object interaction in space and time,respectively.In spatial,the human skeleton key point information is introduced.With the hierarchical structure of human,body part primitives are combined by them.Object-part graph and part-part graph are constructed to describe the interaction relationship between body part primitives and objects,respectively.Interaction features between body parts and objects are implied in the object-part graph,and the part-part graph is used to represent the pose of human during the interaction.The spatial human-object interaction representations are learned by graph convolutional networks.In temporal dimension,the spatial feature of video frames are combined as time segment primitives and associated with a set of learnable implicit primitives,resulting in a richer semantic representation.The proposed method outperforms state-of-the-art methods by 0.3 F1 Score in CAD-120,which proves its effectiveness.Secondly,for long-term temporal modeling in human-object interaction recognition,we propose a multi-scale temporal relational fusion module based long-term temporal modeling.To address the problems of existing 3D convolutional neural networks and recurrent neural networks in temporal modeling,which are too deep to optimize and lack of long-distance temporal dependence,the proposed method fuses features at different temporal scales to capture both local and global information.Constructing multiple temporal graphs can help achieve temporal correlation modeling,meanwhile avoid the problem of difficult optimization and training on dense graphs.Experiments show that multi-scale temporal features enable the network to simultaneously take into account both local continuous temporal dependence and long-distance temporal dependence.The proposed method outperforms state-of-the-art solutions by 1.6 m AP in Vid HOI Long.In summary,this paper focuses on video human-object interaction recognition method based on hierarchical spatial-temporal semantic primitives.For different scenes and data,different spatial-temporal modeling methods are selected to design corresponding interaction recognition networks.Result in improving the spatial-temporal representation of interaction features,which can achieve a more accurate recognition effect in different scenes,and has certain theoretical research value and practical application value.
Keywords/Search Tags:Human-object interaction, Hierarchical spatial-temporal semantic primitives, Primitive combination, Multi-scale fusion
PDF Full Text Request
Related items