Font Size: a A A

Research And Applications Of Interactive Image Generation Technologies Based On Multimodal Disentanglement

Posted on:2022-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:T R NiuFull Text:PDF
GTID:2558306914481304Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
The high-level evolution of the Internet enables everyone to share and sell the contents they create,and the mass production of multimedia content in the form of texts,images,and videos has thereby further fueled the growth of the Internet.The creation of image contents is an important task,however,it is limited to professionals like illustrators and image processors for its difficulty.Therefore,a vast range of needs has arisen to study machine-assisted image creation,where people can create image contents by interacting with machines.For those without painting experience,it is both easiest and the most natural for them to control machines through languages,which relies on text-to-image generation technologies.The existing text-to-image generation models fall short on controllability and interactivity,which prohibits them from practical applications.In this paper,we research three problems from the existing technologies.Firstly,in the field of single caption to image generation,we seek to improve the control of image-side information and the generation quality.There,we proposed Modality Disentangled Generative Adversarial Networks(MD-GAN),which partitions the image features into modalitycommon-and modality-specific-features via disentanglement operations.The experiment results suggest that MD-GAN improves the image generation quality and enables the manipulation of image styles by style transferring and interpolation,which improves the overall controllability.Secondly,in the field of rich text to image generation,we solve the problem of missing attribute information of local objects.We proposed Visual Question Answering Generative Adversarial Networks(VQAGAN),which can perform object attribute disentanglement and improve the details of generated images.It works by aligning the locally related texts that provide information about object attributes and local details to the patches of generated images.We also proposed to use VQA accuracy as a new metric for better evaluation of images containing multiple objects.Extensive experiments show that VQA-GAN can largely improve the accuracy of generated object attributes,and VQA accuracy is more suitable evaluating attributes of multiple objects.Thirdly,in the field of dialogue to image generation,we found it an important task to improve the incrementality of generation.We therefore proposed the RR-GAN model with the Random Replay algorithm.The algorithm works by randomly truncating the dialogue and creating fake training samples,which mitigates the difference of training and testing caused by lacking intermediate image supervisions.To measure the incrementality of models,we built the diagnostic CLEVRD dataset and proposed a complete evaluation framework.The evaluation results show that RR-GAN can significantly improve the incrementality during the generation process without compromising the image quality.It therefore acquired the object disentangling ability and is more ready for text-toimage generation applications.Lastly,we built an interactive image generation system based on the research above,where we implemented both single caption and dialogue to image generation functionalities with an interactive human-machine interface.The system allows users to generate and modify images progressively,which demonstrates the performance of models and the possibilities for machine-assisted image creation applications.
Keywords/Search Tags:text-to-image generation generative, adversarial networks, multimodal, disentanglement
PDF Full Text Request
Related items