Font Size: a A A

Research On Voice-driven Face Generation Method Based On Static Attributes And Dynamic Correlation

Posted on:2022-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhaoFull Text:PDF
GTID:2518306560955429Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
There is a complex correlation between the human sound and its facial image,and the static attribute information and dynamic change information related to the speaker's face can be inferred from the voice signal.The task of voice-driven face generation is to mine the static and dynamic correlations between voice signals and facial images,and build the corresponding audiovisual cross-modal generation model to realize the generation of static face image and dynamic face sequence from the given voice segment.When studying the generation of static face images,the existing methods mostly use time-aligned audiovisual datasets to realize the generation of faces with consistent identities.However,in the actual test process,it is difficult to obtain the true identity of the input voice,which leads to poor model generation and limits the application range of the model to a certain extent.Therefore,this thesis constructs a voice-driven static face generation model based on the conditional generative adversarial networks,and uses the attribute-aligned voice-face dataset for network optimization to generate high-quality and diverse static face images with the consistent attributes(gender,age).At the same time,this thesis also builds a voice-driven dynamic face generation model based on the conditional generative adversarial networks.By adding the designed lip discriminator,the problem that the lip movement and the voice segment are difficult to be accurately synchronized in the existing model is improved,so that the dynamic face sequence generation with lip movement synchronization is realized.Specifically,the main research results of this thesis are summarized as follows:1.Constructed a voice-driven static face generation model,which uses the voice encoder that introduces a self-attention mechanism to accurately extract the auditory feature representation of the voice signal,and feeds the auditory features to the static face generator based on the conditional generative adversarial networks.Then,static face images with consistent attributes(gender,age)are generated during the updating and optimization process of image discriminator with projection module,the model is trained and tested on the attribute-aligned voice-face dataset,and excellent results are advanced.2.Established an attribute-aligned(gender + age)voice-face dataset(Voice-Face),in which the voice segment and face image come from different datasets.Combining the data of these two modalities according to different age groups and genders realizes the correspondence relationship of the attribute combination between the voice signal and the face image.3.Constructed a voice-driven dynamic face generation model,which takes voice fragments and identity facial images as input,integrates the auditory feature vectors extracted by the voice encoder and the image feature vectors extracted by the image encoder,send them to the dynamic face generator.And the designed lip discriminator and the image discriminator work together to alternately update the dynamic face generator,so that the lip movement in the dynamic face sequence is accurately synchronized with input voice segment.The relevant qualitative and quantitative experiments have verified the excellent performance of the model.
Keywords/Search Tags:Consistent attributes, Lip synchronization, Voice-driven, Face generation, Generative adversarial networks
PDF Full Text Request
Related items