Font Size: a A A

Research On Image Captioning Generation Based On Faster R-CNN And Visual Attention

Posted on:2020-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:X C HouFull Text:PDF
GTID:2428330620967830Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In the field of artificial intelligence,the goal of image captioning is to input a given image into a machine,which can generate natural and fluent language in line with human expression.It is very easy for people,but it is extremely difficult for machines.It not only requires the machine to accurately identify the objects contained in an image,but also needs to capture the attributes of the objects and the action relationship between the objects.Therefore,image captioning task has been a research hotspot in the field of computer vision and natural language processing.Inspired by recent work in machine translation,visual attention mechanisms have been widely adopted for image captioning.However,for images with complicated background,most attention models generate captions of poor quality,or even appear to be unrelated to the image content.In addition,most methods force visual attention to be active for every generated word.However,the decoder likely requires little to no visual information from the image to predict nonvisual words,non-semantic words such as "the" and "of" do not have corresponding region in the image,and the decoder only needs to rely on the language model to generate non-semantic words.Finally,the traditional image captioning task is often faced with the exposure bias.Meanwhile,most models are usually trained using the cross entropy loss,but they are evaluated at test time using natural language processing metrics,resulting in the problem of measurement inconsistency.The main research contents of this paper are as follows:1.Proposed an image captioning model that combines bottom-up and top-down attention mechanisms.The bottom-up mechanism proposes a set of salient images regions based on Faster R-CNN,with each region represented by a pooled convolutional feature vector.Then an attention mechanism is introduced into Long Short Term Memory(LSTM)network to determine feature weightings.At each time step,visual attentions attend to image feature to generate caption.Finally,it is verified in the MSCOCO data set,and the experimental results show that the model effectively improves the quality of caption generated.2.Proposed an adaptive attention image captioning model.First propose a novel spatial attention model for extracting spatial image features.Then introduce a new Long Short Term Memory(LSTM)extension,which produces an additional visual sentinel.At each time step,it can automatically decide when to rely on visual signals and when to just rely on the language model.Finally,the Flickr30 K dataset and the MSCOCO dataset are validated.The experimental results show that the model achieves the highest scores in the four evaluation metrics of BLEU,ROUGE,METEOR and CIDEr.Compared with the image captioning model combined with bottom-up and top-down attention mechanisms,the score of the model in the four evaluation metrics was improved by 3%~5%.3.Introduced a self-critical sequence training approach(SCST).It can be utilized to traindeep end-to-end systems directly on non-differentiable metrics for the task at hand.it utilizes the output of its own test-time inference algorithm to normalize the rewards it experience,rather than estimating a baseline to normalize the rewards.The experimental results show that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective,improving the best result in terms of CIDEr from 1.149 to 1.277 in the MSCOCO data set.
Keywords/Search Tags:image captioning, attention mechanism, convolutional neural network, long and short time memory network, word vector
PDF Full Text Request
Related items