Font Size: a A A

Research On Image Caption Algorithm Based On Visual Relationship

Posted on:2021-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2518306047985919Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the great achievements of deep learning in the field of computer vision and natural language processing,it has become possible for people to use neural networks to describe complex visual concepts.Traditional image caption methods heavily rely on hard-coded visual concepts or specified description templates,and it is difficult to generate diverse image captions.In most image caption methods based on deep learning,convolutional network is directly utilized to encode image into a single feature vector,and then it is inputted to recurrent neural network to generate text description.However,these methods have not fully mined the semantic information in the image,and have not considered the structured information between different regions in the image.As a result,most image caption models generally have lower image understanding and poor scalability.In this paper,the following improvements have been made to the above problems in the entire image caption process.The regional visual features are treated as a regional structured graph to expand information representation form in the image.The dependency between vertexes in the graph is decomposed into conditional probabilities with the aid of the mathematical statistics of the visual relationship triplets in the dataset,and assigning weights to the edges between the vertexes in the regional structured graph.Besides,the graph neural network is used to learn the graph embedding features of the image visual regions.Furthermore,a novel visual relationship detection model based on graph neural network is proposed by combining the semantic labels and location information of image regions.The experimental results show that this method has achieved a large performance improvement on the large-scale dataset Visual Genome in the predicate detection task,and this method has also achieved competitive results in the task of visual relationship detection.Image knowledge in similar scenes is universal on different task datasets,thus the proposed visual relationship detection model is used to extract the visual relationship of the image to share image knowledge.Since the visual relationship triplets in the image cannot be directly used in the process of image caption,a semantic relationship graph is used to represent the visual relationship triplets which are contained in the image.Transformer is regarded as the backbone structure of the image caption model to fuse visual and semantic features.For the regional visual features,the multi-head attention mechanism is used to pay attention to the features of different regions of the image.For the semantic relationship graph,the graph neural network is used to encode the semantic relationship graph into a semantic feature embedding matrix,and double-layer attention mechanism is used to provide guiding semantic information for image caption model.Experiments show that the image caption model using visual features and semantic relationship graph realizes a good performance compared with the mainstream models.In summary,the image caption method based on visual relationship proposed can fully mine the image structured information of the image,solving the problem of the semantic gap between visual information and text information to a certain extent.In addition,it can also achieve the scalability tasks such as scene graph generation and visual relationship detection.
Keywords/Search Tags:Visual Relationship, Regional Feature Graph, Graph Neural Network, Semantic Relationship Graph, Image Caption
PDF Full Text Request
Related items