Font Size: a A A

Research On Image Caption Generation Based On Deep Learning

Posted on:2020-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:H S ZhangFull Text:PDF
GTID:2428330578957410Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image captioning is a specific application that combines the field of image recognition with the field of natural language processing which has great impact for children's early education and life support for visually impaired people.The goal is to generate coherent and smooth descriptive sentences for image content.The existing image captioning models are mostly based on the deep learning encoder-decoder framework.The Convolutional Nerual Net.work(CNN)is used as the encoder to extract image features,and the Recurrent Nerual Network(RNN)is used as the decoder to generate descriptive sentences.However,the encoder-decoder framework has the following problems:the feature map extracted by CNN in the encoding process is not sufficient to summarize the large amount ofinformation contained in the image;the image information proportion will get smaller with the RNN keep moving forward in the decoding process.In order to solve the above problems,this paper designs an additional image semantic network based on the encoder-decoder framework.It not only complements the image information,but also compensates for the gradually diffused image information in the decoding process.The semantic attention mechanism is designed to enable the model to focus attention on the corresponding word vector in the image semantic layer when predicting words.The specific work is as follows:(1)The ESRNN model first generates multiple tags for each image according to the description of the image,and uses a CNN as the image semantic network to do multi-label classification tasks to obtain the probability that the image contains each type of label,and combines several word vectors with the highest probability as the image semantic layer.The image feature is then extracted using another CNN as an encoder,and finally the RNN is used as a decoder to generate a natural language description based on the output of the encoder and the image semantic layer.Experiments show that the ESRNN model can generate a better description than the NIC model when it is added to the image semantic network,which makes the ESRNN model increase the BLEU score by 0.021 and 0.018 on the Flickr8k and Microsoft COCO Caption datasets.(2)The AESRNN model is based on the ESRNN model.The semantic attention mechanism designed in this model and the image attention mechanism are added in the decoder to generate context vector.This model also designs gating scalars to control the proportion of encoder output and the context vector when the RNN uses them to generate descriptions,making the AESRNN model can focus on the corresponding position of the feature map when predicting words with image position information.And when predicting words that do not contain image position information,the AESRNN model can focus on the corresponding words in the image semantic layer Experiments show that the AESRNN model has an average increase of 0.031,0.029 on the Flickr8k and Microsoft COCO Caption datasets compared to the ESRNN model.
Keywords/Search Tags:Image Processing, Natural Language Processing, Image Semantics Information, Attention Mechanism, Image Captioning
PDF Full Text Request
Related items