Font Size: a A A

Research On Image Captioning Based On Attention

Posted on:2022-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:J X TianFull Text:PDF
GTID:2518306602993959Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Image captioning can explaine in a common sentence as ”when you look at the picture,you can describe it with a sentence”.Humans can point out and describe a large number of details about the visual scene by quickly browsing an image.However,this extraordinary ability is an elusive task for computers.Image captioning is based on image understanding and text generation.So,it is a typical transmodal task from vision to text.Then the corresponding text description is generated based on image features.It is an important aspect of image captioning to judge which content in the description image and then control the generation order of the description statement.Some studies have shown that attention mechanism can better help us to realize the above functions.In this paper,our mainly studies how to better combine attention mechanism with image captioning model,the main work is:1.Aiming at the existing image captioning only attention to language strategy and can not combine reasoning by capturing visual context information.We propose a novel training mechanism of corelation attention for image captioning.This model considers the previous visual attention as context-aware visual features and decides whether these visual features are used for the current word generation.Compared with the traditional visual attention,the corelation attention can not only focus on important visual regions at every moment,but also deal with more complex visual features over time.In addition,the corelation attention for image captioning also captures the output of the decoder as the semantic feature.Next,we can capture the correlation features by calculating the relationship between the contextaware visual features and the semantic vectors.The experimental results show that the corelation attention for image captioning is more effective than the traditional visual attention method.2.To further enrich the visual features,our uses semantic geometric graphs to establish the relation information between objects.Aiming at the problem that the original semantic geometric graph structure does not match the current task and some node connection errors guide the current task,an image captioning based on channel attention is proposed.Starting with the structure of semantic geometry graph,this model focuses on the complexity of object node,attribute node and relation node in semantic geometry graph.In order to explore how to better learn the effective connection between nodes in the original semantic geometry graph,we use channel attention and multi-layers graph convolution network to learn the edge type and the soft connection between nodes,thus producing more effective multi-hop connection.Then,the attention mechanism is used to determine which type of semantic geometric unit(object node,attribute node and relational node)is most relevant to the current word to be generated.In the last,we use semantic geometric unit to realize the generation of word decoding.The experimental results show that our model has a great improvement compared with the model using original semantic geometry graph.3.Aiming at the each head can attention to different feature information in multi-heads attention mechanism,an image captioning based on multi-head contrast attention is proposed.This model mainly takes into that each head in the multi-head attention mechanism will attention to different feature.when we use linear layer behind all the attention heads,the differentiated features will be mixed again.It is not conducive to the decoding and generation of words.In order to solve this problem,the image captioning based on multi-head contrast attention add a word decoding module after each head.So that,each head generates corresponding word(target word,attribute word and relation word).And then,contrast method selects the most suitable words at the current moment.Experiments show that our model can obtain better performance on natural image and remote sensing image.
Keywords/Search Tags:Deep Learning, Image Caption, Attention
PDF Full Text Request
Related items