Font Size: a A A

The Research Of Image Captioning Based On Multi-Attention Model And Copy Mechanism

Posted on:2020-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2428330590978395Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The cross-learning trend has seen increasing growth,and more research tends to be practically-oriented.Among them,one of the most concerned by scholars is the image captioning,which has great potential in the fields of assisted medical treatment,assisted education and so on.Years,research on image captioning task has been done a lot,and the image captioning model based on deep learning is the focus of which.This model borrows ideas from machine translation and transform the original frame into an encoder-decoder frame,which not only changes the traditional model based on object detection,but also transforms the task of image captioning into an end-to-end “translation” task which is easy to understand.After that,it introduces another essential model from machine translation—attention model into the task of image captioning into the image captioning task.In the model,different domains of the same set of feature map at different times are graded according to their importance degree to locate the next attention position.However,the objects and scenarios for systems or users are too changeable and unpredictable in real application.The current datasets,though contains multiple object categories and application scenarios,is incapable to change the status that the description performance is limited by datasets and language model.Meanwhile,expanding dataset trying to solve the problem will be much difficult and will lead to excessive research losses.Therefore,in order to enhance its generalization ability and robustness,this paper proposes two improvement solutions from the multi-feature fusion and multi-model combination perspective.(1)An image captioning model based on multi-attention mechanism is proposed to correct the image semantic deficiency and inaccurate image attention location which result from the use of the last layer of convolution feature to encode the context vector in the its encode phase.Given that attention mechanism has different attention characteristics in different domains of the same feature map at different times,this model introduces spatial attention model and semantic attention model into the image captioning task and tries to improve the accuracy of object location from two aspects,hidden layer and feature channel.In addition,this model borrows the idea of using multi-layer features to raise performance in object detection,adding multi-layer feature fusion technology into it and using the newly generated feature map to encode the context vectors.The experimental results show that the improved network model not only corrects the inaccurate location of image attention,but also improves the description accuracy.(2)Traditional image captioning models have made a major breakthrough after the existence of encoder-decoder framework and the attention mechanism.However,the problem that descriptive performance depends too much on training datasets and languagemodels remains unsolved.To solve it,this paper proposes an image captioning model based on regularization and copy mechanism.This model's thoughts come from the daily life that people often copy a word and short sentence from others to complete the conversation.Therefore,in this model,we consider adding an auxiliary network called copy mechanism to realize that copying the image content to the captions.In daily conversation,people tend to copy a word or short sentence from each other.By virtue of that,the author attempts to add copy mechanism,an auxiliary network in the model to copy image content into the description captions.In addition,in order to ensure the integrity of image information in decoding process,regularization mechanism is introduced,which can stimulate the current hidden layer to obtain more complete image information by reconstructing the previous hidden state,and can play a regular role in LSTM network.Subsequently,the model has been fully validated through experiments on Flickr 30 K and MSCOCO datasets,and has been proved to effectively solve the problem of weak description generalization performance.
Keywords/Search Tags:Image Caption, Attention Mechanism, Encoder-Decoder Frame, Copy Mechanism, Regularization Mechanism
PDF Full Text Request
Related items