Font Size: a A A

Research On Image Description Generation Based On Visual Attention

Posted on:2021-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:K X FanFull Text:PDF
GTID:2428330623468548Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer vision and natural language processing,the cross tasks of these two fields have also attracted more and more researchers'attention.This paper mainly studies image captioning in visual captioning.This task can be applied to many actual scenes,for example:helping the visually impaired,improving the accuracy of image retrieval,and assisting human-computer interaction.The definition of image captioning is:given a picture,the model is required to automatically generate a natural language description.Most traditional image captioning models use encoder-decoder structure combined with an attention mechanism.The framework has achieved great results,but there are still some problems.First of all,in traditional models,the text descriptions generated by the decoder are regarded as the final result.These methods lack the process of deliberation.Secondly,exposure bias exists in the encoder-decoder structure.Finally,traditional mod-els only focus on the accuracy of generating captions.As a result,there may be situations where the text descriptions of pictures with similar content are the same.In order to solve the above problems,this paper designs an image description gener-ation system based on the deliberation attention mechanism.The system proposed in this paper consists of three parts.Firstly,the model implements the process of deliberation with two layers of the resid-ual attention mechanism.The first-pass residual-based attention layer prepares the hidden states and visual attention for generating a preliminary version of the captions,while the second-pass deliberate residual-based attention layer refines them.The model generates more accurate descriptions by introducing the deliberation process.Besides,this paper combines the cross-modal retrieval method and reinforcement learning.to solve the problem of low discriminability of traditional image captioning models.The reinforcement learning module can alleviate the problem of inconsistent data flow during training and testing and the problem of exposure bias.Finally,the experimental results of our proposed model on MS-COCO and Flickr30K datasets exceed the latest results.Specifically,the model improves the state-of-the-art on the MSCOCO dataset and reaches 37.5%BELU-4,28.5%METEOR and 125.6%CIDEr.It reaches 29.4%BLEU-4,66.6%CIDEr on the Flickr30K dataset.
Keywords/Search Tags:computer vision, natural language processing, image captioning, attention mechanism, deliberation
PDF Full Text Request
Related items