Font Size: a A A

Research On Generative Adversarial Network For Text-to-Image Synthesis

Posted on:2023-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:S Y HuangFull Text:PDF
GTID:2568306818996979Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Today,with the continuous development of technology,people have an increasing demand for images with specific meanings,and the task of generating images from text has attracted much attention.Text-to-image generation aims to convert human natural language into images,which is a cross-modal and cross-cutting task,becoming one of the research hotspots in recent years.As the most widely used generative model at present,generative adversarial network is also the basic solution for text-to-image generation.Nowadays,there have been many works in the field of text-to-image generation based on generative adversarial networks,but there are still some challenges,such as the low sharpness and poor diversity of generated images.Besides,the generated images are not matched enough with the given text.Therefore,how to efficiently optimize the generative adversarial network so that it can generate high-quality images with conformance to the given text semantics has become an urgent problem to be solved.To solve the problems of low visual quality and insufficient semantic consistency of images generated by existing generative adversarial networks,this paper conducts in-depth research and analysis on network structure optimization and cross-modal information fusion enhancement in the network.Strengthen the connection with the given text.The main research contents and results are as follows:(1)Generative adversarial networks with adaptive semantic normalization.Batch Normalization(BN)is used in current text-to-image models to accelerate and stabilize the training process.However,the BN ignores feature differences between individuals and semantic relationship between modalities,which is negative for text-to-image tasks.To solve the problems,a novel module suitable for text-to-image task called Adaptive Semantic Instance Normalization(ASIN)is proposed.The ASIN considers the individuality of generated images and introduces text semantic information to the image normalization process,establishing a consistent and semantically close correlation between generated images and given text by computing the affine parameters for denormalization according to text features.The optimized network achieved Inception Score of 4.59 and 30.53,FID of 15.04 and 30.78,and R-precision of 85.60% and 93.85% on the two datasets CUB-200-2011 and MS-COCO respectively.(2)Generative adversarial network with multi-modal interactively selective attention.Generating high-quality images with semantical consistency relies heavily on the text information given.In order to improve the full use of the effective information in the text,a generation adversarial network with multi-modal interactively selective attention is proposed.The network contains a multi-level attention-based cross-modal fusion module,which uses the attention mechanism to capture key words in text features and key regions in image features according to the similarity of the two modal features,improving the fusion quality of the two modes.At the same time,a method for memorizing text features is also proposed,through which the subsequent generation stage can obtain the focus of the text of the previous stage,assisting the subsequent generation process.The method achieves Inception Score of 4.79 and 30.98,FID of 14.48 and 28.66,R-precision of 80.51% and 91.35% on CUB-200-2011 and MS-COCO respectively.(3)Generative adversarial networks based on multimodal triplet loss.Due to the subjective nature of text descriptions,cross-modal fusion-based methods can only reflect objective descriptions of texts,and when the linguistic descriptions change,the generated images will also change unpredictably.To avoid this problem,making the generative network perceive consistent high-level semantics across texts and improving the deep understanding of a given text,this paper proposes a generative adversarial network based on multimodal triplet loss.The method modifies the input of traditional text-generated images by replacing the traditional one-to-one image-text pairs with one-to-many image-text pairs,i.e.,the input of the same image paired with multiple different text descriptions.The generated results of texts describing the same image can be considered as a class.The multimodal triad loss added to the generator brings the distance within classes closer and pushes the distance between different classes further,thus helping the generative network to refine the high-level semantic information common to different text descriptions and enabling the generative network to maintain the consistency of the generated images even when the linguistic expression changes.The method achieves Inception Score of 4.74 and 34.10,FID of 13.30 and 26.22,and Rprecision of 83.32% and 92.77% on CUB-200-2011 and MS-COCO,respectively.
Keywords/Search Tags:generative adversarial network, text-to-image generation, instance normalization, attention mechanism, multimodal triplet loss
PDF Full Text Request
Related items