Font Size: a A A

Research On Image Captioning Algorithms Based On Deep Learning

Posted on:2021-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J C HuFull Text:PDF
GTID:2428330614960194Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Image captioning task is committed to giving the computer the ability to "talking about pictures",that is,under the condition of given input pictures,it can automatically generate the text sequence that conforms to the natural language expression rules and truly reflects the image content.The task usually uses image recognition model or object detection model as feature extractor or entity detector to obtain image features for further use in captioning model.However,the existing image captioning algorithms can not make good use of the output of upstream tasks.This is often due to the introduction of attention mechanism in solving the long-distance dependence problem from sequence to sequence generation task,which leads to the "over attention" problem.Finally,the model ignores the content that is not significant in the image,resulting in the sentence generated by the captioning model missing some image details.In addition,exposure bias and label bias will be introduced when optimizing model parameters by minimizing cross entropy objective function:exposure bias refers to that the model always takes words in reference sentence in the training phase,but uses the words of generated sentences in the test phase,which will lead to error accumulation.Label bias refers to that the model always generates high-frequency scenes and high-frequency words of reference sentences in the training stage.At the same time,cross entropy loss function also leads to the lack of diversity and over correction of captioning sentences.Although the problem of exposure bias and label bias can be partly solved by introducing reinforcement learning algorithm into image captioning task,this kind of algorithm usually uses "automatic evaluation metrics"(such as Bleu,Meteor,CIDEr and Rouge)as reward value.Because these metrics are not completely related to the evaluation standards of human experts,thus leads to the phenomenon that the model only strengthens the metrics and does not improve the quality of caption sentences.In this paper,an image captioning framework with hybrid attention mechanism for reinforcement learning is proposed.The framework improves the performance of the model through two designs:hybrid attention mechanism and inverse reinforcement learning method.(1)The hybrid attention mechanism is composed of visual self attention mechanism and soft attention mechanism.The former is used to focus on the major objects in the image,and the latter is used to represent the relationship between all the detected objects.This design avoids the problem that the attention mechanism pays too much attention to a certain major object.Eventually,the output of the two attention mechanisms is concatenated as the input of the following modules.(2)The reward of model self-learning is obtained from the mapping of image features and sentence features.The reward of "evaluation metrics" is only determined by the n-gram matching degree of the sentence itself.The former can ensure the correspondence between the sentence and the image.(3)In the training stage,the generated sentences and reference sentences are mapped to Boltzmann distribution,and then the generator network is trained to solve the problems of exposure bias,label bias and overcorrection,and increase the diversity of sentences.Finally,the experimental results on Microsoft coco dataset show that the algorithm proposed in this paper has some advantages over some current algorithms in qualitative and quantitative aspects.
Keywords/Search Tags:image captioning, deep learning, object detection, attention mechanism, inverse reinforcement learning, generative adversarial networks
PDF Full Text Request
Related items