Font Size: a A A

Research On Image Caption Generation Method Based On Deep Learning

Posted on:2019-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:J F WangFull Text:PDF
GTID:2438330548958407Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,with the advent of large-scale data sets,deep learning has achieved great success in many traditional computer vision tasks due to its excellent computing capabilities,especially in image recognition.However,the existing research is to divide the image into one or more discrete labels.It does not describe the relationship between objects in the image,nor does it describe what is happening in the image.To solve this problem,this paper uses the latest deep learning technology to design a model that can connect images with natural language,so as to achieve image subtitle generation.The model designed in this paper mainly includes two parts,one is the image feature extraction part and the other is the language modeling and generation part.The image feature extraction part uses a pre-trained convolutional neural network as a feature extractor,and the language modeling and generation part uses a LSTM network with a cyclic structure.On this basis,this paper designs two ways to connect these two parts and build it into a neural network that can use end-to-end training.(1)Connect the two parts in a fully connected way,that is,connect the full connection layer features from the convolutional neural network to the LSTM network.This method is easy to operate,less computational,and can also achieve basic image caption generation to a certain extent.Its disadvantage is that the global feature information of the image is only used during initialization,and the position information between image contents is ignored.(2)A new approach based on the attention mechanism is used,which is more complex and computationally intensive,but it can make full use of the image features at each location to produce better results.This method first extracts the twodimensional image features from the convolutional neural network convolution layer,then transforms the image feature vectors and word vectors into the same dimensional space through two full connect transforms,and then computes the transformation between the two vectors.Similarity is used as the size of the model's attention.Finally,the attentional image features and word vectors are used to generate words at the next moment.(3)The above two models are trained on the data set Flickr8 K and its Chinese version Flickr8 K CN,thereby realizing the subtitle generation in both Chinese and English languages.Experiments show that the model has good adaptability to different languages,and the model with attention mechanism is superior to the basic model without attention in various evaluation indicators.
Keywords/Search Tags:deep learning, computer vision, image caption generation, attention model
PDF Full Text Request
Related items