Font Size: a A A

Research On Key Techniques Of Multimodal Image Caption

Posted on:2021-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:X MengFull Text:PDF
GTID:2428330647450747Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image semantic understanding is one of the popular research directions in the field of computer vision,mainly including image classification,object detection,semantic segmentation,and multimodal image understanding.The purpose of multimodal image understanding,i.e.image captioning,is to generate a fluent sentence to describe the rich and comprehensive image content of objects,relationships,and events.In recent years,the rapid development of neural networks has brought new ideas to the research of multimodal image understanding.This paper focuses on how to express the complete semantic information of the image.Therefore,multimodal image understanding is taken as the research content in this paper,which brings novel feasible ideas and methods to image semantic understanding.The existing multimodal image understanding models usually adopt an encoderdecoder framework.The widely used autoregressive decoder generates sentences of better fluency but suffers from problems,such as slow sequential decoding and inaccurate semantics.Although the non-autoregressive decoder adopts the faster parallel decoding,the quality of the generated description sentences is poor.In view of the shortcomings of the two decoders,this paper proposes a masked non-autoregressive decoder.Moreover,all the existing decoders use cross-entropy as the loss function,which has the problem of equally treating data of different quality in training.In response to this problem,this paper proposes reinforced cross-entropy loss function and stochastic deprecation.The specific work is as follows:1.To improve the decoding speed and quality,this paper proposes a masked non-autoregressive decoder.This paper first selects several kinds of masking ratios.In the training process,given each pair of images and their ground truth sentences,a masking ratio is randomly selected to mask several positions in the ground truth sentence.The training goal is to predict the complete sentence.Since the adopted decoder draws on the network framework of the non-autoregressive decoder,the advantages of non-autoregressive parallel decoding are retained,and the masked language model also incorporates the conditional probability of the target language distribution to obtain the advantage of autoregressive decoding.In the prediction process,this paper generates image descriptions in parallel in several fixed stages from completely masked token sequences to completely unmasked token sequences.The experiments in MSCOCO benchmark show that the decoding speed of masked non-autoregressive decoder is faster,which is 2.8 times and 1.66 times that of the same configuration of the autoregressive decoder in the 4-stage and 7-stage decoding respectively.Moreover,the quality of the generated description sentences is higher,and the semantic content is retained more accurately and effectively,since the proposed masked non-autoregressive decoder reaches 21.1 in terms of SPICE metric that is more in line with human evaluation standards,exceeding the autoregressive decoder 0.9 and exceeding the non-autoregressive decoder 4.4.2.To treat data of different quality differently during training and at the same time alleviate the problem of inconsistency between the cross-entropy loss function(CEL)and evaluation metrics,this paper proposes reinforced cross-entropy loss(RCEL)and stochastic deprecation(SD).In the reinforced cross-entropy loss function,this paper first uses the selected evaluation metric to calculate the quality score of each truth sentence and then multiplies it by the log probability of each word of the truth sentence to obtain the loss function.The combination of the quality score of the sentence and the loss function not only treats the data of different quality differently but also indirectly optimizes the evaluation metric to alleviate the inconsistency between the loss function and the evaluation metric.In the stochastic deprecation module,without loss of corpus diversity,it randomly selects high-quality ground truth sentences and discards noise.Reinforced cross-entropy loss and stochastic deprecation are universal and can be combined into RCEL-SD.The experimental results on MSCOCO benchmark show that the RCEL-SD proposed in this paper is superior to CEL in terms of the whole 7 evaluation metrics in the three state-of-the-art multi-modal image understanding modelsand the average improvement on all models is BLEU-1 0.74,BLEU-2 0.90,BLEU-30.95,BLEU-4 0.85,METEOR 0.44,ROUGE 0.52,CIDEr 4.38 and SPICE 0.57.
Keywords/Search Tags:Image scene understanding, Encoder-decoder, Masked Non-autoregressive Decoder, Reinforced Cross Entropy Loss, Stochastic Deprecation
PDF Full Text Request
Related items