Font Size: a A A

Deep Learning-Based Image Caption

Posted on:2019-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J Z LouFull Text:PDF
GTID:2428330572456404Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Image caption is an emerging task which involves computer vision and natural language processing technologies,and has important practical applications,such as image retreval,helping visually impaired people,human-computer interaction and so on.Different from those coarse image understanding tasks as image classification and object detection,which assign independent labels or tags to images,image caption aims at generating a meaningful natural language description of an image which requires that the model not only recognizes objects contained in an image,but also other visual elements like activities and attributes,understands the relationships among these objects,and then describes these semantic information with readable natural language sentence.Therefore,image caption is a very challenging task.Most traditional approaches on image caption can be broadly divided into two categories: template-based method and retrieval-based method,but they are all rely too much on the complex previous visual processing technology and the language model for sentence generation on the back-end is not optimized enough,making it difficult to generate high-quality sentences.Recently,deep neural network based encoder-decoder framework has been widely applied to image caption and has made great progress.We study on the deep learning based image caption,and the main work is:1.We propose an adaptive attention based image caption approach.The approaches based on encoder-decoder framework which involves visual attention mechanism have been the most popular solutions to image caption task,however,the sentence decoder of these models are usually too simple and the single LSTM based decoder is difficult to generate rich descriptions.To address this problem,inspired by the recent image caption work which uses adaptive attention model,in this paper,we propose a new model for image caption which is composed of a 101-layer residual network(Res Net-101)to encode the input image,and a stacked two-layer Long Short Term Memory(LSTM)as decoder to generate the output caption.We evaluate the proposed approach on the well-known Microsoft COCO(MSCOCO)caption dataset and show that our method is better than the adaptive attention model and can achieve superior results when compared to most state-of-the-art models.2.We optimize our model with a reinforcement learning approach.Most of existing image caption approaches are trained by maximizing the likelihood of each correct word.There are two major problems in these existing image caption methods.Firstly,there is an exposure bias between the training and testing.Specifically,the sentence decoder is trained to predict a word given the previous ground-truth words,while at testing time,the caption generation is accomplished by greedy search or with beam search,which predicts the next word based on the previously generated words that is different from the training mode.As a result,once a word is not predicted accuratelly at test time,then the prediction of all words behind will be affected and the sentence will become worse and worse as the error accumulated.Secondly,there is a loss-evaluation mismatch.Specifically,language models are usually trained to minimize the cross-entropy loss,while at testing time,the performances of these models are typically evaluated with the sentence-level evaluation metrics,e.g.,BLEU,METEOR,ROUGE-L,CIDEr,etc.,which are non-differentiable and cannot be directly used as training loss.We further optimize our model with a Reinforcement Learning(RL)approach which directly optimizing the CIDEr metric to simultaneously address the exposure bias problem and the loss-evaluation mismatch problem.We evaluate the model trained by RL approach on MSCOCO caption dataset and show that significant gains in performance can be realized.
Keywords/Search Tags:deep learning, image caption, attention mechanism, reinforcement learning
PDF Full Text Request
Related items