Font Size: a A A

Research On Text-to-Image Synthesis Based On Generative Adversarial Network

Posted on:2023-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2558307097479174Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the most important tasks in the field of multi-modal learning,the text-to-image generation task has received more and more attention in recent years.This task takes the given natural language description as condition and generates images that are realistic and semantically consistent with the language description.This task has a wide range of applications,such as computer-aided design,image editing and illustration generation.However,due to the natural semantic gap between textual modality and visual modality,generating images from text descriptions is a challenging task.Current methods mostly take text descriptions as conditions.The generation process is decomposed into multiple sub-generation stages,supplemented by different mechanisms to adjust the image content,and gradually generate realistic and semantically consistent images.Although these methods have achieved considerable results,there are still some problems.To address these issues,this paper conducts research from the perspectives of optimizing attention mechanism,improving loss function and network structure.The main contents are summarized as follows:(1)Aiming at the problem that most models focus more on correcting the semantic content of the image according to the text information when adjusting the image features,ignoring the image texture information;Besides,when constraining text-image semantic consistency,these methods only model the relationship between the text and single image region,ignoring the fact that the corresponding region sizes of different types of text should also be different.It results in a decrease in the diversity and discriminability of the network representations.This paper proposes a Multi-scale Dual-modal Generative Networks(MD-GAN).The model is based on cascaded multi-stage generative adversarial networks,and mainly contains a dual-modal modulation attention and a multi-scale consistency discriminator.The dual-modal modulation attention is a combination of the textual guiding module and the channel sampling module.The textual guiding module focuses on using text information to guide the adjustment of image semantic content,adaptively connecting text-context feature and image feature through a gating mechanism,while the channel sampling module selectively aggregates information along the image channel dimension to adjust image texture information.The multi-scale consistency discriminator calculates the similarity between image regions at different scales and word-level text feature as constraints to enhance semantic consistency between text and images.Comprehensive experiments on the bird image datasets CUB and the complex scene image datasets MS-COCO show that the proposed MD-GAN outperforms state-of-the-art methods,and the images generated are also realistic and conform to the text description.(2)Aiming at the problem that the models based on cascaded multi-stage generation adversarial networks have high complexity and large training cost,while the single-stream generated models reduce the model parameters,but their training process are unstable.This paper proposes a text-to-imge generation image network structure named Dual-granularity Semantic Fusion Generative Adversarial Networks(DSF-GAN).The structure consists of three generation stages,the initial generation stage takes the conditioned text as input and generates low-resolution coarse images.And the second and third stages are the proposed dual-granularity semantic fusion blocks that generate high-resolution and high-quality images.Specifically,dual-granularity semantic fusion block first uses word-level textual features to guide the fine-grained image pixel generation.Then it adopts residual structure to maintain network performance and further fuse image information.In addition,it upsamples images to higher resolution.Finally,the block adjusts the coarse-grained high-level semantic features of image through sentence-level text features.Therefore,it guides image generation from the fine-grained spatial dimension and the coarse-grained channel dimension,respectively.In addition,DSF-GAN consists of a pair of generator-discriminator and two generators.In this way,the DSF-GAN can effectively reduce model parameters while maintaining training stability.Through the comprehensive experiments on the CUB datasets and the MS-COCO datasets,the DSF-GAN basically outperforms state-of-the-art methods in each matric.Besides,the visual effects of the generated image is more convincing and the semantic content is more consistent with the textual description.
Keywords/Search Tags:Cross-modal learning, Text-to-image synthesis, Generative adversarial networks, Attention mechanism
PDF Full Text Request
Related items