Font Size: a A A

Image Caption Model Based On Deep Reinforcement Learning

Posted on:2021-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:S C LiuFull Text:PDF
GTID:2518306311970879Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Computer vision and natural language processing are two popular directions of current AI research,while image caption is a cross-fertilized and comprehensive research topic that has received increasing attention from academia and industry in recent years.The machine needs to understand the image content by using the model,grasp the semantic information of the image and learn to express the image content in natural language,and finally generate smooth.coherent and readable sentences with correct semantic logic,.At present,there are some problems in image description based on deep learning:firstly,the loss function of the model is cross entropy loss,which is inconsistent with the optimization direction of bleu4 and other indicators;secondly,there is exposure bias problem in the training and testing stages of the model,that is,the model will use the real label as the training data,In order to solve the above two problems,this paper focuses on the reinforcement learning algorithm based on transformer model and self-evaluation mechanism.The main work of this paper is as follows:1.We proposed an image caption generation model named "Caption Transformer" based on the classical Transformer model which was designed for machine translation.The encoder-decoder framework of image caption generation has become mainstream framework,and the application of attention mechanism has greatly improved the image caption effect.In most of the related works,convolutional neural network is used as the encoder to extract image features,and recurrent neural network is used as the decoder to generate caption.However,simple CNN is not strong in feature extraction and not capable of multi views learning.RNN are designed to generate words sequentially which limits the model's parallel computing capability.And RNN suffers from long-term dependency problems.To address these problems,we apply the classical Transformer model to image caption.And we apply multiple objective detectors before the encoder to extract image features and improve the model's multi views learning capability.Images features detected are input to the decoder.Finally we conduct experiments on the MSCOCO dataset and the results show that our model performs better than the Faster RCNN+LSTM benchmark model.2.We proposed an image caption generation model based on reinforcement learning algorithm of self-evaluation.In order to solve the problems of exposure bias in the training and testing stage and the inconsistency between the optimization objectives and the evaluation metrics,we introduce reinforcement training method,referring to AC algorithm and Q-learning algorithm,and propose a self-evaluation based reinforcement learning algorithm on the basis of caption transformer.Most of the current image caption generation models are modeled by the maximum likelihood estimation,and they are trained by minimizing the cross entropy loss function.The decoder is trained with the given previous ground-truth words to predict a word at the training stage,however the decoder relies on the previously generated word to predict the next word at the testing stage.As a result once a word is not predicted accurately at testing stage,errors will be accumulated and transferred,then the prediction of all words behind will be affected.This problem is the exposure bias.The model aims to minimize the cross entropy loss function at training stage,while at testing stage we usually use sentence-level evaluation metrics,e.g.,BLEU to evaluate the performances of the model.There is inconsistency between the model training objective and the evaluation metrics.In order to solve these problems,we use reinforcement learning training approaches of self-evaluation to directly train the model with BLEU and other metrics.The comparative experiment shows that this approach can further improve the effect of the model.
Keywords/Search Tags:Deep Learning, Image Caption, Transformer, Reinforcement Learning
PDF Full Text Request
Related items