Font Size: a A A

Research On Image Paragraph Captioning Method Based On Deep Learning

Posted on:2021-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:X H HeFull Text:PDF
GTID:2518306476452684Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology and the arrival of the information age,image paragraph captioning task has far-reaching research significance and broad application value in the fields of cross-modal content retrieval,human-computer interaction,robot navigation and industrial applications,such as electronic commerce,children education and so on.Standard image captioning task focuses on single sentence captions generation,which is suffering from the problems of missing details and subjective bias due to the limited coverage of a single sentence.Dense captioning task aims to generate phrase-level captions and suffers from the problem of missing relationships between different objects.Meanwhile,this task is unable to efficiently interact with humans due to the weak correlation of independent phrases.Therefore,this thesis mainly focuses on paragraph-level image captions generation,aiming to solve the above problems and generate detailed and naturally coherent paragraph-level descriptions.To address the problems of poor diversity and coherence exist in paragraphs generated by the current state-of-the-art image paragraph captioning model,this thesis improves a two-stage training strategy.Firstly,the model carries out word-level training in the first stage by using cross-entropy loss with the goal of producing accurate words.In order to provide a less weak baseline model for the second training stage,this thesis proposes a penalty strategy for potential duplicate n-gram.Experiments show that this strategy can effectively reduce the probability of generating redundant sentences with complete repetition and improve the diversity of paragraphs.Then,because of the loss-metric mismatch problem and exposure bias problem,as well as the not yet fundamentally resolved diversity problem in the first stage,this thesis introduces and improves the Self-critical Sequence Training based on Reinforcement Learning in the second training stage.This thesis proposes novel methods to model the diversity and coherence of text that attracts more human attention when evaluate long text.The word-level weights and the n-gram-level distributions in the human-generated paragraph are introduced to make the proposed modeling methods closer to human consensus.By combining the diversity and coherence rewards with the automatic evaluation metric CIDEr,the model is directly guided to produce diverse and coherent paragraph captions.Finally,a series of experiments are designed to prove the effectiveness of the proposed method.Experimental results show that the proposed two training stage model improves the current state-of-the-art model in terms of four out of six standard evaluation metrics.
Keywords/Search Tags:Deep Learning, Image Captioning, Faster R-CNN, Reinforcement Learning
PDF Full Text Request
Related items