Font Size: a A A

Image Caption Generation Based On Generative Adversarial Networks

Posted on:2019-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:F W WangFull Text:PDF
GTID:2428330590467388Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The task of image caption generation is about automatically generating a descriptive sentence given an image.This task draws a lot of attention from artificial intelligence researching field in recent years,because it shows great potential in many application scenarios and it involves two major fields of artificial intelligence: computer vision and natural language processing.Most recent captioning researches base themselves on the encoder-decoder structure with training target of maximum likelihood estimation(MLE),which maximizes the probability of model producing ground truth captions.However,MLE has two defects: first,MLE ignores the diversity of language,for example,two sentences with different words,phrases and structures can express the same meaning;second,MLE suffers from exposure bias problem which refers to the divergence between training phrase when inputs come from ground truth captions and inference phrase when inputs come from last sampled words.Thus we design an image captioning model based on generative adversarial network,which overcomes the two defects of MLE.GAN is composed of generator network and discriminator network,of which the generator tries to generate good captions that can fool discriminator and the discriminator tries to distinguish the generated captions from natural captions.Two networks take turns to train and finally converge.We design a generator utilizing the encoder-decoder structure,where we propose“time-dependent pre-attention”(TDPA)mechanism to help decoder better understand the relations in images.The function of TDPA is letting every image features attend to other image features and compose an aggregated feature that contains relation information.The aggregated features are used for subsequent attention from decoder.We also design a discriminator that uses recurrent neural network to encode input sentences and reference sentences,and performs semantic match between the encoded inputs,encoded references and image features.The generator cannot directly be trained because the outputs of generator are discrete words and gradient cannot propagate back through them.Thus we propose a training algorithm based on reinforcement learning.In reinforcement learning framework,the generator is treated as an agent and the output of discriminator is treated as a reward from environment,then we use policy gradient algorithm to estimate the gradient of generator,where self-critical baseline is used to reduce the variance of estimated gradient.We perform experiments on public caption dataset Microsoft COCO,and the experiment results illustrate that TDPA mechanism improves the model performance under multiple automatically evaluation metrics and adversarial network effectively improves the quality of generated sentences.
Keywords/Search Tags:image captioning, GAN, reinforcement learning, attention mechanism
PDF Full Text Request
Related items