Font Size: a A A

Generative Adversarial Network For Text-to-Image Synthesis

Posted on:2021-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:D Y ChenFull Text:PDF
GTID:2428330623968547Subject:Engineering
Abstract/Summary:PDF Full Text Request
The text-to-image synthesis task aims to generate images matching semantically the input sentence,which describes details(e.g.,color and shape)of the object.Because a sentence can match semantically several images with different content,this task is required not only semantics of the generated image and the input text are consistent,but also the content of the generated image is diverse.And all text-to-image synthesis models use the Generative Adversarial Networks(GANs)as the basic framework.However,due to the imperfect theory of GAN itself,this would cause instability in training.Meanwhile,to make the generated images sufficiently realistic and natural,the resolution of the generated images is expected to be large enough,which would inevitably bring a large amount of network parameters and calculations.In this work,we propose the following three algorithms for specific problems:1)Aiming at the problem of the unstable training process,we propose Perceptual Pyramid Adversarial Network(PPAN).This network adopts the pyramid structure to enhance all-scale feature representations.It also adopts the perceptual loss to directly regularize the generated images and the real-world images in the feature space.The above modules are built on the basic hierarchical-nested structure.Observed from the experiments,they not only make the training process more stable,but also improve the quality of generated images.2)Aiming at the problem of the large amount of parameters and calculations in the network,we propose Lightweight Dynamic Conditional GAN with Pyramid Attention(LD-CGAN).This network strives to simplify greatly the structure,but not to reduce the quality of generated images.In LD-CGAN,the information compensation theory is proposed.Specifically,the previous methods only take once semantic information as input.But in LD-CGAN,the input text features are firstly unsupervised semantically disentangled.Then,the proposed Conditional Manipulating Module is used to continuously compensate the disentangled semantics to all-scale features.Compared with PPAN,the amount of parameters and calculations is reduced by up to 80%,and the quality of generated images is not reduced.3)Aiming at the problem of low quality of generated images,we propose Finegrained Perceptual Pyramid Adversarial Network(FPAN).This network adopts the training strategy,from the whole to the parts.Based on the initial high-quality image generated by the Whole Synthesizer,the Parts Synthesizer adopts the word features to enhance local regions in the generated image.And the discriminators in the Parts Synthesizer introduce word-by-word attention mechanism to improve semantically consistency.Therefore,FPAN makes full use of word features to correct and refine generated contents.Consequently,the fidelity,vividness and diversity of images generated by FPAN greatly exceeded results of the state-of-the-art models.
Keywords/Search Tags:Deep Learning, Computer Vision, Natural Language Processing, Text-toImage Synthesis, Generative Adversarial Network
PDF Full Text Request
Related items