Font Size: a A A

Research On Image Caption Based On Deep Learning

Posted on:2020-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2518306464495504Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the explosive development of technologies such as the Internet,new media and mobile devices,the network is filled with images that are difficult to use due to lack of annotation.As a key technology of computer interpretation and description of image content,image caption can effectively solve the problems of image annotation imperfections,and is widely used in image retrieval,image annotation,image analysis and other fields.Deep learning has a wide range of applications in image caption.However,due to to the neglect of the important influences of scene factors on description sentences in images and corpora in previous research,which leads to the existing neural network model is not ideal for image description.Meanwhile,because of the shallow level of previous text generation model,it is difficult to make full use of image features to generate description statements,which affects the accuracy and richness of the statement to some extent.Aiming at the above problems,this paper studies and improves the image feature extraction,corpus information extraction and description statement generation model.The specific work is shown as follows:(1)In the image feature extraction part,in order to better acquire the scenes,people,objects and their relationship in the image,for the lack of scene and objects information in the dataset,this paper proposes an image scene extraction method based on migration learning,which is used to construct a model with more scene feature extraction ability.The model first trains the Res Net and Faster R-CNN models on the large-scale scene dataset Places365 and large-scale object dataset Image Net respectively,then migrating parameters to the model to realize the extraction of scene and object features in image.The two features optimize and complement each other,the object features complement the scene information,the scene features clear the object information,making the image text description more accurate and rich.(2)In the corpus scene information extraction part,in order to better describe the scene information and accurately use the vocabulary corresponding to the scene.This paper proposes a corpus scene information extraction algorithm,and uses LDA to analyze the text in the image corpus,and recognizes the scene through the vocabulary in the text.The relationship between the scene and the vocabulary in the text is obtained,so that the model uses the vocabulary associated with the image scene with a high probability in the process of generating the new image description,which reduces the range of vocabulary selection when generating the new image description.(3)In the description statement generation part,in order to make full use of the previously obtained images,corpus scene information and object information,this paper introduces a two-layer LSTM,enhances the model’s complexion of image scene information and corpus scene information by using the context information of the two layers of LSTM units.so that the underlying LSTM can better transfer some specific information(such as scenes,objects and their attributes)to the top-level LSTM,and then a large number of scene-related words are used accurately in text description,which solves the problems of limited description ability and low accuracy of sentence generation in single-layer LSTM.Finally,the methods proposed in this paper are experimentally verified in the Flickr8 K,Flickr30K,and MSCOCO data sets,and the analysis is performed using four indicators: BLEU,METEOR,ROUGH-L,and CIDEr-D.The experimental results show that the accuracy of the description statement is significantly improved after incorporating the image scene information,object information and corpus scene information.Especially in the MSCOCO dataset,the BLEU-1 index has an increase of 11.4% compared with the original model,In contrast to Deep-Vis,the increase was also increased by 17.1%.It can be seen that the model proposed in this paper has a significant effect,and compared with the original model the performance is greatly improved.Compared with other mainstream methods,it also has advantages.
Keywords/Search Tags:image caption, convolutional neural network, long and short time memory network, scene recognition, semantic information
PDF Full Text Request
Related items