Font Size: a A A

Research On Fine-grained Image Caption

Posted on:2022-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J YunFull Text:PDF
GTID:1488306509458304Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The main task of image caption is to automatically describe the content of images using natural language.It involves two research fields computer vision and natural language processing.Image caption is an important research work in the artificial intelligence community.Image captioning has achieved great progress,which is motivated by offering valuable practical applications,such as image retrieval,semantic visual search,multi-modal retrieval,visual intelligence in dialogue robots,visual dysfunction assistance,military reconnaissance,street monitoring,safety feedback.The proposed method studies the fine-grained image captioning so that the image caption can not only give the general description of the objects in the image but also more precisely describe the attributes of the objects and the relationship between the objects.This makes the task of image captioning complex and challenging.It is mainly reflected in the following aspects: 1)the existing methods usually generate only one or several short sentences for an image,and generally only focus on the expression of salient objects in the image,ignoring the accurate description of its attributes.Furthermore,the same attributes can describe different types of objects,and it is not easy to match the corresponding objects for attributes.2)Only the visual relationship of the objects in the image is used as the interaction between the objects.However,the number of visual relations in the image is huge,and the amount of annotation relation is relatively small.When encountering the un-annotation visual scene,the generated caption has only some significant objects.3)The existing methods only describe the objects in the image in a general way lacking specific information such as names of people and places in the real world.How to get fine-grained image description,how to correctly match the attributes and relationships of objects,how to get the name of objects,these are very challenging problems.In order to solve the above problems,this paper starts from the perspective of fine-grained semantic learning to solve three key problems:(1)accurately matching objects and their attributes.(2)Mining objects,object attributes and the relationship between objects.(3)Introduce named entities for objects in the image.The main research work of this paper is summarized as follows:(1)We propose a gated object attribute matching network for image caption.This method uses the visual attention mechanism based on long short-term memory(LSTM)neural network to locate objects in the image and uses the semantic attention mechanism based on LSTM to obtain attribute labels in the image.The object and attribute are matched by joint learning of gating units.The gated object-attribute matching network for image caption gives more attribute details when describing objects,which makes the generated image caption more precise.(2)We propose an image caption method with scene graph alignments.The method studies the alignments between image and caption at the fine-grained level.We generate scene graphs using the object,attribute,relation as nodes both for image and caption.We reconstruct them by the alignment of the image scene graph and caption scene graph.The method consists of an image scene graph generator,sentence scene graph generator,feature mapper,caption generator.The proposed method can further generate the attributes and relations of objects,and then give a caption containing more detailed semantic information.Experimental evaluation shows the effectiveness of the proposed method.(3)We propose a context-driven named entity-aware image caption method.To generate the caption of specific information including the name of person,place,and so on,the proposed method studies to integrate the Internet news as background knowledge into the image caption,and realizes the context-driven name aware image description method.Under the guidance of image content,this method extracts named entities from news,analyzes the semantic relevance of named entities by using knowledge graph,entity linking algorithm,quantitative set verification algorithm,and finds the relation between entities from a global perspective,to obtain more specific image caption of information.
Keywords/Search Tags:Image Caption, Attention Mechanism, Context driven, Scene Graph
PDF Full Text Request
Related items