Hierarchical Visual Semantic Embedding For Image Captioning

Posted on:2020-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:C Shen

Full Text:PDF

GTID:2428330575464630

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of computer vision,machines are not limited to the tasks of detection,recognition,segmentation,etc.It can work for the task of automatically describing the objecti've content of the image,a.k.a.image captioning.Unlike image classification or object detection tasks,images captioning intends to describe the holistic description of important scenes,objects,and relationships among them in natural language.This is an important part of computer visual content understanding.The image captioning methods rely on an encoder-decoder structure,which encodes an image using a pre-trained convolutional neural network,and then uses a recurrent neural network to decode the sentence.However,due to the extraction and embedding of an abstract visual feature,it is impossible to explicitly represent the hierarchical visual semantic of images.Some methods propose an image captioning method based on detected visual concepts.However,the detected visual concepts are object-centered.The image captioning method based on detected visual concepts cannot express the hierarchical visual semantic well and does not take the scene semantic into consideration.The context of the scene semantic information captures the higher level semantic information encoded in the picture,such as the location at which the picture was taken and the possible activities involving the people in the image.The words of the generated caption will vary for a particular scene type.The scene context is extracted from the image as visual semantics and used to influence the attention module and text generation.Aiming at the abo've problems in the current image captioning method,this paper proposes an image captioning based on hierarchical visual semantic embedding to compensate for these defects.The main innovations of this paper include:1)For the first time,hierarchical visual semantic information,namely scene semantic and object-centered semantic,is considered,and the relationship among scene semantics,object-centered semantics and caption generation is modeled.The scene semantic provides the context of the scene for the attention module,guiding the attention to the object-centered semantics and the generation of the captions;2)A scene-based factored attention mechanism module is proposed to guide the visual information embedding and caption generation at different visual levels.The scene-based factored attention mechanism is used to guiding three different levels of image regional features,object-centered visual concepts,and scene semantic.Information is embedded in different ways.Effectively represent different levels of visual semantic information.In this paper,the experiments are carried out on the popular caption dataset.Through comparison with other current mainstream methods,it can be found that under the same conditions,the proposed method far exceeds the other image caption methods in various indicators.

Keywords/Search Tags:

Image Captioning, Hierarchical Visual Semantic Embedding, Scene-based Factored Attention, Scene understanding

PDF Full Text Request

Related items

1	Research On Visual Perception And Reconstruction Of Regular Object In Indoor Scene For Robots
2	Intelligence Perception And Understanding Of Environment Based On Machine Vision
3	Researches On Image Understanding Based On Hierarchical Visual Perception Mechanisms
4	Research On Key Technologies Of Scene Understanding Based On Machine Learning
5	Visual Scene Understanding through Semantic Segmentation
6	Semantic-based Scene Understanding
7	Research On Image Understanding Algorithm Based On Semantic-Binding Hierarchical Visual Vocabulary
8	Research On Image Understanding Algorithm Based On Semantic-binding Hierarchical Visual Vocabulary
9	Image Feature Understanding And Semantic Representation Based On Deep Learning
10	Research On Key Techniques Of Video Semantic Understanding Based On Dynamic Scene Understanding