Font Size: a A A

Audio-driven Talking Face Generation

Posted on:2021-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2428330614456795Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Facial image and video generation are hot research topics in computer vision and have a wide range of applications.Audio-driven talking face generation takes both audio and identity information as input,generates talking face videos in which mouth moves with its corresponding audio.Generating high-quality facial videos with lip-audio synchronization and natural-looking facial expressions are the main challenges for talking face generation.Aiming at generating facial videos with natural expressions,this dissertation proposes expression-tailored generative adversarial network and cycle training-based expression-alterable network.In addition,this dissertation considers mutual information maximization between video expression representation and audio emotion representation,which can also drive target identities' expression.The main contributions of this dissertation include the following aspects:1)This dissertation proposes the Expression-Tailored Generative Adversarial Network(ET-GAN)based on encoder-decoder framework to generated facial videos with different natural-looking expressions.The expression encoder and the optical flow discriminator in ET-GAN are used to extract expression features and ensure continuous videos respectively.The expression encoder disentangles target emotion information from input video clips and applies it to another identity.In this case,the other identity's talking face video also reveals the same target expression.The optical flow discriminator indicates the magnitude of facial motion,distinguishes generated video sequences are continuous or not and avoids deformed faces.Experimental results show the effectiveness of our ET-GAN on both CREMA-D and GRID datasets.2)This dissertation proposes the Expression-Alterable Generative Adversarial Network(EA-GAN)which is capable of flexibly translating input facial videos to any desired target expression.Different from ET-GAN,a one-hot vector is utilized to switch different expressions,and emotional-driven video clips are no longer needed.With the help of contrastive loss,the visual feature of mouth movements and its corresponding audio feature are forced to locate in the same feature space.In this case,the mismatch between mouth shape and audio is solved in this unsupervised task.Experimental results in the CREMA-D dataset shows the flexibility of translating different expressions.3)This dissertation proposes the Mutual Information Maximization based Emotion-Tailored Generative Adversarial Network(MIMET-GAN).Since the underlying emotional information in audio signal is closely related to the emotional information in visual signal of facial expressions,the mutual information estimator in MIMET-GAN maximizes the mutual information between the audio and visual features to generate natural-looking expressional talking face videos which are related to emotional audio signals.In this case,both video and audio clips can be used to drive target identities' expression.Experimental results in the CREMA-D dataset shows the effectiveness of MIMET-GAN.
Keywords/Search Tags:Audio-driven talking face generation, multi-modal generation, unsupervised generative adversarial network, cycle training-based method, mutual information maximization
PDF Full Text Request
Related items