Font Size: a A A

Research On Text-to-image Generation Based On Generative Adversarial Networks

Posted on:2021-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:M Q HuFull Text:PDF
GTID:2518306476453274Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text-to-image generation aims to generate related images from text descriptions in natural language and realize the transformation from text modal to image modal with the semantic invariance.Text-to-image generation is of great significance for image generation applications such as news automatic photo-matching,user demand portraits,and so on.As a cross problem,it involves two hot researches on natural language understanding and high-resolution image generation.The research progress of this problem is also beneficial to the development of text embedding technology and deep generative models.The current text description is usually an explanatory sentence describing the properties of an object,such as 'The bird is mostly yellow and black with a narrow,pointed beak'.The traditional methods based on patchwork are difficult to generate fine-grained,natural and re-alistic images.As the development of deep learning and generative adversarial networks,we are able to output images in the end-to-end manner.However,the current image generation is mostly noise-based and under a unconditional setting,and only a few works have conducted preliminary explorations on image generation with textual conditions.Current researches can be divided into word-level image generation and sentence-level image generation by the gran-ularity of the text.The word-level image generation is dedicated to generating images from a single entity word.Entity words are usually represented by class labels,which are isolated and poor in representation,and the information contained is also difficult to generate a detailed im-age.Sentence-level image generation is dedicated to generating images from a single sentence.However,current methods tend to extract embedded representations from the entire sentence without taking into account the importance of some entity words in the sentence for target im-age.This thesis focuses on the word level and sentence level to conduct research and tries to combine them.The main work in this thesis are as follows:1)A variational conditional generative adversarial network for word-level image genera-tion is proposed.Traditional generators in conditional GAN simply concatenate the conditional vector with the noise vector as the input representation,which is directly employed for upsam-pling operations to output an image.However,the hidden condition information is not fully exploited,especially when the input is a class label for word-level image generation.There-fore,a variational inference is introduced into the conditional GAN framework to infer a se-mantically rich latent variable only from the conditional input,which helps achieve a variable augmented representation for image generation to catch the rich semantic details behind the condition.Qualitative and quantitative experimental results show that the proposed method outperforms the state-of-the-art approaches on word-level task and also achieves the realistic controllable images when extended to sentence-level task.2)A sentence-level image generation method combined with entity knowledge learning is proposed.Many existing approaches learn to generate an image in an end-to-end manner from a global embedding of sentence and they do not highlight the entity information.However,the class categories of entities described in text are very important for generating semantically-aligned images with clear objects.For instance,knowing that the visual information of the two key entities,'bird' and 'sky',in the sentence 'A bird is flying in the sky' would make it easier to generate an image containing these two entities.Therefore,in this thesis,the entity knowledge learning method is proposed to solve the sentence-level task.It generates target images based on joint learning of entity information and sentence global semantics.Specifically,this method introduces a word-level generation network to learn entity knowledge from entity labels,and attempts to make representation fusions in image feature space and low-dimensional semantic hidden space with sentence-level generation network.It highlights the entity information on the global representation of the sentence and can generate an image with clear entities and semantic alignment with the text description.We also propose a novel metric named Entity Matching Score(EMS)to measure the degree of consistency of the generated image with its corresponding text description.Experimental results demonstrate the effectiveness of the proposed method,which outperforms the robust baselines significantly on two benchmarking datasets.
Keywords/Search Tags:Text-to-image Generation, Text Representation, Image Generation, Generative Adversarial Networks, Variational Inference
PDF Full Text Request
Related items