Font Size: a A A

Hierarchical Visual Semantic Embedding For Image Captioning

Posted on:2020-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:C ShenFull Text:PDF
GTID:2428330575464630Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of computer vision,machines are not limited to the tasks of detection,recognition,segmentation,etc.It can work for the task of automatically describing the objecti've content of the image,a.k.a.image captioning.Unlike image classification or object detection tasks,images captioning intends to describe the holistic description of important scenes,objects,and relationships among them in natural language.This is an important part of computer visual content understanding.The image captioning methods rely on an encoder-decoder structure,which encodes an image using a pre-trained convolutional neural network,and then uses a recurrent neural network to decode the sentence.However,due to the extraction and embedding of an abstract visual feature,it is impossible to explicitly represent the hierarchical visual semantic of images.Some methods propose an image captioning method based on detected visual concepts.However,the detected visual concepts are object-centered.The image captioning method based on detected visual concepts cannot express the hierarchical visual semantic well and does not take the scene semantic into consideration.The context of the scene semantic information captures the higher level semantic information encoded in the picture,such as the location at which the picture was taken and the possible activities involving the people in the image.The words of the generated caption will vary for a particular scene type.The scene context is extracted from the image as visual semantics and used to influence the attention module and text generation.Aiming at the abo've problems in the current image captioning method,this paper proposes an image captioning based on hierarchical visual semantic embedding to compensate for these defects.The main innovations of this paper include:1)For the first time,hierarchical visual semantic information,namely scene semantic and object-centered semantic,is considered,and the relationship among scene semantics,object-centered semantics and caption generation is modeled.The scene semantic provides the context of the scene for the attention module,guiding the attention to the object-centered semantics and the generation of the captions;2)A scene-based factored attention mechanism module is proposed to guide the visual information embedding and caption generation at different visual levels.The scene-based factored attention mechanism is used to guiding three different levels of image regional features,object-centered visual concepts,and scene semantic.Information is embedded in different ways.Effectively represent different levels of visual semantic information.In this paper,the experiments are carried out on the popular caption dataset.Through comparison with other current mainstream methods,it can be found that under the same conditions,the proposed method far exceeds the other image caption methods in various indicators.
Keywords/Search Tags:Image Captioning, Hierarchical Visual Semantic Embedding, Scene-based Factored Attention, Scene understanding
PDF Full Text Request
Related items