Font Size: a A A

Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism

Posted on:2021-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:M TianFull Text:PDF
GTID:2518306047485924Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of deep learning,researchers pay more and more attention on the cross field of image and text.Image captioning and text-based image generation are the two main research directions in this cross.Image captioning task automatically generates a text description according to the content of image.Given a text description,text-based image generation generates a picture,which matches the text content.Because of the asymmetry of two modes,those two tasks are facing different challenges.For image captioning task,how to encode image information more accurately and make decoder produce more natural and fluent sentences are difficult.For image generation,the research focuses on how to improve the performance of the generator and increase the stability of model training.Therefore,this paper explores the following:(1)Since semantic attributes of image reflect the visual content of images to a certain extent,and the information redundancy is low,we use the attributes as the encoding information for image captioning.However,the common attribute features contain noise,so we use crossmodal retrieval to find the salient words of an image,and then construct the salient word vector to linearly weight the attribute features to reduce the impact of noise words.On the other hand,when the decoder generates sentences,we find that the beam search under the common log-likelihood can not find the best description sentences completely,so we propose two kinds of sentence re-ranking methods based on cross-modal related semantics: Sentence re-ranking based on visual features and text features and sentence re-ranking based on pseudo reference sentences,and select the candidate sentences under re-ranking method to select sentences that are more suitable for image description.(2)At present,most image generation work focuses on the joint attention mechanism between text and image,that is,how to align text and image better in semantic space.But they ignore the attention mechanism within image features.Therefore,we propose to introduce the mixed attention model into the generator to pay attention on both the feature map and the channel simultaneously,so that the generator can produce more reasonable images.In the aspect of model training,because the traditional loss function is prone to model training instability,we propose to add the square loss to the traditional generative adversarial loss,so that the generator can get more discrimination information and further improve the generation performance.In addition,we add spectral normalization to the discriminator,and increase the training stability through the restriction of parameter gradient.Finally,the qualitative and quantitative experiments on the CUB and Oxford datasets,as well as ablation experiments on each module,prove the effectiveness of the proposed method.
Keywords/Search Tags:Image captioning, cross-modal retrieval, sentence re-ranking, image generation, attention mechanism
PDF Full Text Request
Related items