Font Size: a A A

Fusion And Reasoning Of Video Visual Relation Detection Based On Graph Neural Network

Posted on:2022-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y X NieFull Text:PDF
GTID:2518306572450964Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the emergence of the big data era and the continuous maturity of deep learning technology,basic computer vision tasks have made great progress,even exceeding the human performance in tasks of image classification,object detection,etc.However,in some tasks with complex scenes,the deep learning model still faces many challenges.It is worth studying the problem of how to recognize multiple objects and understand their relationship.Video visual relation detection(Vid VRD)is such an important task for high-level semantic understanding.Compared with the general object detection task,video visual relation detection requires not only to predict the categories and trajectories of each object,but also to predict the relationship between objects,which can be formed as a relation triplet <subject,predicate,object>,such as <person,ride,horse>.Based on the relation triplet,the model can better understand the visual scene,which has great application value in the intelligent robot,automatic driving,etc.However,the existing methods still meet many challenges in video visual relation detection: it is necessary to identify the long short-term visual relations between various objects in the video;or the necessity of fusing similar and overlapping visual relations in different video frames.To address these challenging problems,this thesis proposes a fusion and reasoning model of video visual relations based on graph neural network,which can be described specifically as follows:First,the graph neural network is used to capture contextual information in time and space and predict short-term object relations.In this work,a spatio-temporal map containing coordinate information of objects in adjacent video clips is first constructed.Then,the author uses the improved GCTrans model based on the graph neural network to gather the context information and model the visual relation,so as to obtain the visual relationship of the multiple objects in the video clip.Second,the author utilizes a multi-hypothesis tree to fuse the extracted short-term visual relations to obtain long-term visual relations.Based on the multiple hypothesis fusion(MHF),this thesis constructs a multi-hypothesis tree to preserve all possible correspondence,and each node in the tree represents the relationship fragment observed in the video clip.When using the MHF to process video clips sequentially,the observed relationship fragments are selectively added to the corresponding tree as leaf nodes to update or create the hypothesis.Each path from the root node to the leaf node represents a complete relationship instance.Finally,extensive experiments conducted on video visual relation detection benchmark,Image Net-Vid VRD,show that our model has significant improvement.Using several GCTrans modules continuously based on graph neural networks can significantly improve the prediction of visual relationships.The proposed multi-hypothesis fusion module(MHF)can effectively fuse the extracted short-term visual relationships,and thus predict more complex video-level long-term visual relationships.The two stages of relationship prediction and relationship fusion have been significantly improved and achieved advanced performance at present.
Keywords/Search Tags:deep learning, video visual relation detection, graph neural network
PDF Full Text Request
Related items