Font Size: a A A

Visual Relationship Generation Based On Scene Understanding

Posted on:2022-11-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y GuoFull Text:PDF
GTID:1488306764460124Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Computer vision is an essential field of artificial intelligence.The vigorous growth of technology in this field is critical to the national economy and the people’s livelihood.The basic mission of computer vision is to let the computer imitate the human visual system to understand the visual content in digital images or videos.At present,some visual cognitive tasks,such as image classification and instance detection,have made great progress.However,recognizing and detecting instances in isolation can not fully understand the visual content in the image scene.The rich relationships between instances are also important to understand the semantic content of images.Therefore,this dissertation focuses on how to explore the relationship information in visual signals and express it in a structural way.Specifically,this dissertation uses the graph structure to represent the relationship information,and the task conducted in this dissertation is scene graph generation(SGG),to understand the scene content completely.The graph structure is composed of nodes and edges.Nodes represent instances in the image,including the bounding boxes and categories of instances.An edge connects a subject node and an object node,and represents the visual relationship from the subject to the object,which is expressed by a predicate.The scene graph can also be decomposed into a series of relationship triplets from subjects to objects,that is <subject,predicate,object >,such as <man,ride,horse>.In summary,this dissertation explores visual relationship generation based on scene understanding.Since the current image recognition and detection neural networks have achieved excellent results,the deep features extracted from these networks can fully characterize the instance information in the image.Therefore,the difficulty of extracting relationship information lies in how to fully utilize and supplement these deep features,so that the model can reason accurate and specific relationships under various data conditions.Consequently,this dissertation summarizes four key problems of visual relationship generation,i.e.,context lack of the local CNN feature,insufficient knowledge with a few samples,deficiency of information content in the long-tail data,and semantic confusion of relevance predicates.Then,this dissertation proposes and designs relation regularized network,multiple structured knowledge,balanced predicate learning strategy,and semantic debiasing module to solve above problems,respectively.The main conclusions are as follows:(1)How to extract effective global and relational context information from local features of convolutional networks to assist relationship prediction? This dissertation designs a relation regularized network to capture context information to assist scene graph generation.Because the prediction of relationships is not isolated,they are extremely dependent on the visual context information,that is,the information in the surrounding environment.The local features extracted from the convolution network do not contain this context information.To solve this problem,firstly,two kinds of context information are summarized,namely relational context and global context.In order to obtain these two kinds of context information,the graph convolutional networks and the bidirectional long short-term memory network are used to extract these two kinds of context information from instance features to assist scene graph generation.(2)How to predict the relationship well with a few samples? This dissertation designs a multiple structured knowledge network to make up for the lack of knowledge in scene graph generation with a few samples.Human beings can learn rich relationship information under the condition of a few samples.However,the current scene graph model depends on a large number of samples for learning.In order to simulate the human learning style,this dissertation first proposes a one-shot scene graph generation task.Then,in order to make up for the lack of knowledge in one-shot scene graph generation,this dissertation defines relational knowledge from the Visual Genome dataset and common sense knowledge from the Concept Net dataset.To extract knowledge features from the multiple knowledge,this dissertation organizes the two kinds of knowledge information into the graph structure.Then,the graph convolution neural network is used to encode the multiple structured knowledge and generate knowledge features to assist one-shot scene graph generation.(3)How to improve the information content of the generated relationships in the long tail data? This dissertation designs a balanced predicate learning strategy to increase the information content in generated scene graphs.At present,the scene graph models are trapped in common predicates with poor information,and cannot adequately predict informative predicates.This not only destroys the overall performance of the current models,but also hinders the application of generated scene graphs to downstream tasks.This dissertation argues that the above problem is mainly caused by the long tail distribution of predicate samples in the training space.Consequently,this dissertation proposes a scene graph generation framework based on balanced predicate learning,which uses a random undersampling strategy and an ambiguity removing strategy to improve the information content of the results generated by the scene graph models.(4)How to alleviate the semantic confusion of relevance predicates? This dissertation designs a semantic debiasing module to revise the prediction results and make them specific.Since relationship predicates are usually relevant,the scene graph models easily confuse relevance predicates with semantic overlap.Consequently,this dissertation constructs the relation matrix of predicates,and uses the matrix in training and inference to alleviate semantic confusion.Specifically,this dissertation uses a confusion matrix of a baseline model and a bipartite graph based on subject-object overlap to construct the semantic relation of predicates,respectively.Then,this predicate relation is applied to the predicate distribution generated by the model to alleviate the semantic ambiguity of the generated scene graph.Finally,this dissertation summarizes the above research contents and looks forward to the potential research directions that may have an important impact on the development of the visual relationship.
Keywords/Search Tags:Visual Relationship, Scene Graph Generation, Structured Knowledge, Balanced Learning, Semantic Debiasing
PDF Full Text Request
Related items