Font Size: a A A

Research On Text Description Image Generation Based On Generative Adversarial Network

Posted on:2022-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:T HuFull Text:PDF
GTID:2518306323467024Subject:Data Science
Abstract/Summary:PDF Full Text Request
Generating images from text descriptions is the intersection of natural language processing and computer vision.This task requires inputting a text description to gen-erate images with semantic details.Since natural language and image are two different modalities,a language description can correspond to image with many different pixels.so it is a challenging task to generate high-resolution images that meet the semantics of the text.In recent years,text to image generation models mostly use a stack of multiple generative adversarial networks as the architecture.Images with different resolution are generated through each stage,and the model outputs high-resolution images in the final generative stage.However,the training process of this architecture is very unstable,and these models are not easy to converge,the generated image is more like a stack of various text attributes,lacking authenticity,especially when it is necessary to generate complex images based on text,the discriminator cannot provide sufficient supervision information.This thesis studies the problems of the multi-stage stacked generative architecture,and proposes a new generative adversarial networks framework for text to image gen-eration task.The main research work is summarized as follows:(1)In view of the current stacking text to image generation model usually encoun-ters the problem of long training time and not easy to converge,a model that generates high-quality images through a single-stage generative adversarial networks is proposed.We use the residual structure to design the generative adversarial networks,and use the hinge loss to stabilize the training process.Aiming at the problem that the generated image cannot fully integrate the text information and the generated image is not au-thenticity,we design a new attention module,which introduces channel attention and pixel attention mechanisms to focus on the most relevant visual feature maps to the text description.The experimental results on CUB-200-2011 bird dataset show that the pro-posed model achieves good results in image generation authenticity and convergence speed.(2)In view of the fact that the discriminator cannot provide sufficient supervision information when generating complex images,We propose a cross-modal projection mechanism based on the single-stage generative adversarial networks to capture the semantic consistency of text and images,so as to provide fine-grained discrimination information for the generator.We combine the local language representation into the discriminator.By mapping the last two downsampling layers in the discriminator to the local and global representations of the text,the discriminator can provide effective supervision signals for the generator,thereby generating high visual quality images with reasonable layout.The experimental results on the MS-COCO complex scene dataset show that our proposed model has better image visual effects than the current multi-stage generation model.
Keywords/Search Tags:Text-to-Image Generation, Generative Adversarial Networks, Channel Attention, Pixel Attention, Cross-Modal Projection
PDF Full Text Request
Related items