Font Size: a A A

Research On Multi-modal Generative Models Based On Deep Neural Networks

Posted on:2024-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X R ZhouFull Text:PDF
GTID:1528307160458894Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Multi-modal generative models are an important research topic in computer vision,involving the generation of multiple output data from various input data.They simulate the process of human integrating and associating multiple sensory inputs and are seen as a method of simulating human creative processes.Research on multi-modal generative models can help us understand how the human brain processes information and can serve as a tool to explore human creativity and imagination.On this topic,there are many algorithms and solutions aimed at improving the quality of generated models,enhancing the expression and fusion of multi-modal information,optimizing model training and inference,and advancing the development and application of multi-modal generative models.This thesis focuses on the problem of generating images and videos from multi-modal data,which involves the proper fusion of information from different modalities,establishing connections between them,controlling the generation process,and ultimately producing high-quality and photo-realistic results.Multi-modal generative models are models that involve multiple data domains,so that it is necessary to consider how to properly represent and fuse data from these different domains in the design,and establish the connections between them.Additionally,as a generative model,it also needs to consider how to control the generation process to influence the generated results in the expected way and generate high-quality and photorealistic images.Therefore,in the research of multi-modal generation models,not only the representation and fusion of data but also the control of the generation process and the optimization of the results should be focused on.This thesis summarizes three key factors in multi-modal generative models: multimodal feature representation,cross-modal fusion,and establishing cross-domain correspondences.Based on the understanding of these three key factors,three research questions are proposed: how to capture appearance attribute information from text descriptions and edit images? how to establish dense cross-domain correspondences at full resolution?and how to model dynamic faces and generate talking head videos driven by audio.Thus,this thesis proposes a research and analysis framework for multi-modal generative models,which elaborates on the theme of ”three key factors-three research questions”.This thesis discusses the research from three different perspectives: different modalities of data,promoting cross-modal fusion and establishing connections,and conditional control generation.Three methods are developed accordingly: text-based person image generation method,full-resolution correspondence learning method in image translation,and talking head generation method based on neural radiance field.The main contributions of this thesis are as follows:1.To capture attribute information from text descriptions and edit images,this thesis develops a text-based person image generation method.The thesis focuses on generating person images through text editing while considering appearance attributes and complex geometric relationships of human poses.A new problem setting and corresponding solution are proposed.The thesis extracts information from natural language descriptions and establishes a mapping relationship between images and text,thereby achieving automatically editing person images based on text control.Additionally,the thesis develops a new perceptual score as an evaluation metric for this task.2.To efficiently establish cross-domain full-resolution correspondences,this thesis develops a full-resolution correspondence learning method in image translation.The developed method employs a hierarchical strategy and improves the Patch Match algorithm by iteratively searching for local optimal solutions to approximate the global optimal solutions.Additionally,a Conv GRU module is introduced to optimize the matching results,considering a larger context range and incorporating historical estimation information into the matching process.The developed method is capable of end-to-end training and can be extended to larger resolution scenarios.The method effectively solves the challenge of learning full-resolution correspondences,leading to better image translation results.3.To model dynamic faces and generate talking person videos driven by audio,this thesis develops a talking head generation method based on neural radiance field.The developed method utilizes an implicit neural scene representation network to model dynamic talking person faces,which can be driven by audio and generate high-quality talking head videos using stereo rendering techniques.To improve the quality of the generated results,the thesis employs joint learning of neural scene representations and camera calibration,as well as a coarse-to-fine position encoding strategy.The method supports audio inputs from different identities and enables facial viewpoint editing,exhibiting good generalization and robustness.The proposed method extends the application of neural radiance fields to the field of face generation and reproduction.
Keywords/Search Tags:Correspondence learning, generative adversarial networks, neural radiance field, generative models, multi-modal models
PDF Full Text Request
Related items