Font Size: a A A

Research On Image Text Caption Algorithm Based On Deep Learning

Posted on:2022-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:H B YuFull Text:PDF
GTID:2518306734957709Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Image text caption algorithm is an interdisciplinary research field that expresses the elements in the picture and the relationship between them in fluent natural language.By imitating the processing principle of the brain nervous system after receiving information,human beings design various models to let the machine simulate the brain,transform the pictures into descriptive sentences,so that the machine can explore and perceive the world.Image text caption has great application prospects in the fields of blind guidance,image search and automatic semantic annotation of medical images.Convolution neural network(CNN)is used to process the data set image and extract the image features in image text caption algorithm.After the recurrent neural network,the final pair of images is generated should be a natural caption statement.There are two early implementation methods of image text caption algorithm: 1.Template based image text caption method,which detects the elements and the relationship between the elements,and then adds words to the template constantly,but this method is too rigid;2.Retrieval based image text caption method,which first searches the image similar to the current image as a template,it needs to be adjusted before retrieving the image relationship.This process will complicate the model.In view of the shortcomings of the above two methods,this paper uses the encoder decoder structure based on deep learning as the overall framework of the model,which can extract image features more accurately and generate more reliable caption sentences.The main work of this paper is as follows:1.An image text caption algorithm based on Inception-v3 and Word2 Vec technology is proposed.In this model,the basic structure of encoder and decoder is used as the framework of the model.In the first part of the model,Inception-v3 is used to extract the features of the images in the dataset efficiently.In the second part,in order to avoid the possible loss of information due to the weight drop in the long-term information transmission process in the recurrent neural network,the long short-term memory network(LSTM)is used instead of the recurrent neural network to deal with the problem of information loss.When encoding text features,Word2 Vec technology is used to replace One-Hot technology to establish the relationship between annotated words.The experimental results show that the performance of the improved model on MSCOCO dataset is better than that of the original model.2.In order to achieve better performance of image text caption algorithm model,convolution block attention module(CBAM)is introduced into convolution neural network model.This model still uses the basic structure of encoder and decoder as the framework of the model.The first part uses the convolution neural network Inception-v4 which has better effect to extract the key features of the image in the dataset.Convolution block attention module is added after each inception module in Inception-v4 to make each inception more effective the output features of the model should be purified by spatial attention mechanism and channel attention mechanism,so that the model can focus on more important information.The latter part adopts the long short-term memory network.Compared with other models,the loss graph of the model in training can't directly observe the effect,so in addition to the loss graph of the model,a series of objective evaluation criteria(CIDEr,Bleu-1,METEOR,Bleu-4,etc)are used to make a better comparison on the quality of natural caption sentences generated by the model.
Keywords/Search Tags:image text caption, long short-term memory network, word vector model, convolution block attention module
PDF Full Text Request
Related items