Font Size: a A A

Image Captioning Based On Improved Attention

Posted on:2021-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2518306503472094Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image Captioning involves knowledge from both Computer Vision and Nature Language Processing,it is a challenging task,as it not only asks the model to recognize and detect the objects in an image,but also requires to find their relationships and finally generate a natural,fluent description.This task is very important in research and application.Studying image captioning will take us more close to the ultimate target,true image understanding,and from the aspect of application,image captioning can help the visuallyimpaired people see the environment and improve the efficiency of imagetext retrieval and image labeling.The prevailing method of image captioning is applying attention to the encoder-decoder framework.Researchers has attempted to improve it from all kinds of aspects,for instance,utilizing multiple stages to refine the description,modifying the decoder structure,using better encoder to get semantic features or attributes.Attention has been a commonly used mechanism in artificial intelligence area,in essence,it's just a weighted sum of a set of features,where the weight means the extent of focus,it is simple but could bring huge enhancement,our paper mainly focus on attention in image captioning.We propose two ways to improve attention.We think that the information in previous attention result can guide the attention of next step,so we propose the recurrent attention,including naive recurrent attention,gated recurrent attention and LSTM-based recurrent attention.Meanwhile,we find that in common attention the relation among features is not explicitly considered,hence we propose self-attention to model this relationship and get a representation with global information,and we study the combination of self-attention and common attention.As experiments on MSCOCO dataset suggests,the naive recurrent attention could not utilize the information from previous time step efficiently,the gated recurrent attention can slightly improve the scores of evaluation metrics,but this improvement is limited,LSTM-based recurrent attention could markedly improve the performance of captioning model,and this progress does not come from the 3-layer structure.Self-attention itself could improve the model as an attention mechanism under encoder-decoder framework,the sequential fusion method makes it harder for the model to learn and train,and the parallel fusion method can get the best result,which is competitive with the state-of-the-art model.
Keywords/Search Tags:Image Captioning, Attention, Recurrent Attention, Self-attention
PDF Full Text Request
Related items