Font Size: a A A

Image Caption Method Based On Deep Learning

Posted on:2020-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z ChangFull Text:PDF
GTID:2428330599951289Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Deep learning technology has been explored in the field of computer vision and natural language processing,on which deep neural network achieves outstanding performances and shows the powerful ability for representation learning since 2012.As an interdisciplinary and cross-modal problem,the image caption task is a vital exploration to extend the representation learning ability of the deep neural network to multiple data domains.The goal of the image caption task is to allow the computer to automatically generate a descriptive text for the image,but such simple description is more inclined to capture the point information of the global image,rather than fine-grained entities,especially when the image contains more objects,it is difficult to completely convey the details of the image with one sentence description.For the image caption task cannot cover rich underlying semantic problems,the dense caption task appeared in 2016.The task not only needs to locate meaningful regions in the image,but also requires to describe the detected regions with language.In this paper,we make the following researches on the problems existing in the image caption task and dense caption task:For image caption task,the method based on neural network generally has two problems.First,the image feature vectors extracted by the convolution neural network can only retain a few significant features of the original image,which will lose a lot of useful image information,and the generate sentences are often accompanied by mispredictions of visual attributes.Moreover,RNN has a gradient vanishing problem,with the increase of time step of RNN,gradient error will gradually disappear in the process of back propagation,resulting in the lack of previous information guidance for the words generated by the subsequent time step.To solve these two problems,we introduce a multimodal fusion method for generating descriptions to explain the content of images,in each step of word generation,high-level semantic information and sentence features are used to guide word generation.Also,the model uses the object detection method to generate the attribute information of the image,in the meantime,a temporal convolution structure is used to extract the sentence features.The sentence features are fused into each RNN time step to enhance the RNN long-range dependency modeling of historical words.In order to better demonstrate the impact of the two multimodal information on performance,many different structures are designed and validated on the Flickr8 K,Flickr30K and MSCOCO datasets.The experimental results show that the performance of the model can be significantly improved by adding two kinds of multimodal information to the baseline model(GRU,LSTM,Peephole LSTM).Especially on the MSCOCO dataset,the BLEU@4 and CIDEr metrics increased by 4.1% and 10.4%,respectively.For dense caption tasks,the backbone structure of the existing model is usually shared by the two modules of regional image feature extraction and regional object detection.However,in practice,we found that the shared backbone structure leads to inefficient training of the model,because the parameters inside the backbone structure are difficult to adapt to two training objectives at the same time.To solve this problem,the paper proposed a Bifurcate Inception structure.The structure separates the two modules into region image feature extraction and region target detection respectively,and cooperates with an alternating fixed training method,so that the two modules of the model can be trained from end to end without interference.Based on the Bifurcate Inception structure,the model has two other improvements.First,in order to further enhance the model's capability of regional target detection,this paper redesigned a onestage target detector to replace the weak RPN(Region Proposal Network)network.Second,in training the image region description part,the model will not only extract the regional convolution features,but also extract the image global attribute information,so that LSTM can use two kind of visual information during training to improve the accuracy of model region caption.According to the above improvement ideas,the paper conducted sufficient experiments on the open dataset of Visual Genome.The model reached mAP 8.21 on the Visual Genome v1.0,and reached mAP 8.39 on the Visual Genome v1.2,which is nearly 53% higher than the previous mainstream FCLN(Fully Convolutional Localization Networks)dense caption model.
Keywords/Search Tags:image caption, dense caption, object detection, recurrent neural network, deep learning, LSTM
PDF Full Text Request
Related items