Audio-driven Talking Face Generation

Posted on:2021-05-11

Degree:Master

Type:Thesis

Country:China

Candidate:H Liu

Full Text:PDF

GTID:2428330614456795

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Facial image and video generation are hot research topics in computer vision and have a wide range of applications.Audio-driven talking face generation takes both audio and identity information as input,generates talking face videos in which mouth moves with its corresponding audio.Generating high-quality facial videos with lip-audio synchronization and natural-looking facial expressions are the main challenges for talking face generation.Aiming at generating facial videos with natural expressions,this dissertation proposes expression-tailored generative adversarial network and cycle training-based expression-alterable network.In addition,this dissertation considers mutual information maximization between video expression representation and audio emotion representation,which can also drive target identities' expression.The main contributions of this dissertation include the following aspects:1)This dissertation proposes the Expression-Tailored Generative Adversarial Network(ET-GAN)based on encoder-decoder framework to generated facial videos with different natural-looking expressions.The expression encoder and the optical flow discriminator in ET-GAN are used to extract expression features and ensure continuous videos respectively.The expression encoder disentangles target emotion information from input video clips and applies it to another identity.In this case,the other identity's talking face video also reveals the same target expression.The optical flow discriminator indicates the magnitude of facial motion,distinguishes generated video sequences are continuous or not and avoids deformed faces.Experimental results show the effectiveness of our ET-GAN on both CREMA-D and GRID datasets.2)This dissertation proposes the Expression-Alterable Generative Adversarial Network(EA-GAN)which is capable of flexibly translating input facial videos to any desired target expression.Different from ET-GAN,a one-hot vector is utilized to switch different expressions,and emotional-driven video clips are no longer needed.With the help of contrastive loss,the visual feature of mouth movements and its corresponding audio feature are forced to locate in the same feature space.In this case,the mismatch between mouth shape and audio is solved in this unsupervised task.Experimental results in the CREMA-D dataset shows the flexibility of translating different expressions.3)This dissertation proposes the Mutual Information Maximization based Emotion-Tailored Generative Adversarial Network(MIMET-GAN).Since the underlying emotional information in audio signal is closely related to the emotional information in visual signal of facial expressions,the mutual information estimator in MIMET-GAN maximizes the mutual information between the audio and visual features to generate natural-looking expressional talking face videos which are related to emotional audio signals.In this case,both video and audio clips can be used to drive target identities' expression.Experimental results in the CREMA-D dataset shows the effectiveness of MIMET-GAN.

Keywords/Search Tags:

Audio-driven talking face generation, multi-modal generation, unsupervised generative adversarial network, cycle training-based method, mutual information maximization

PDF Full Text Request

Related items

1	Text/Speech-Driven Talking Face Generation With High Naturalness
2	Research On Handwritten Font Generation Algorithm Based On Generative Adversarial Network
3	Research On Voice-driven Face Generation Method Based On Static Attributes And Dynamic Correlation
4	Research On Data Generation Model Based On Generative Adversarial Network
5	Multi-view Faces Generation Based On Generative Adversarial Network And Its Application In Assistant Recognition
6	Text To Image Generation Based On Generative Adversarial Nets
7	Research And Applications Of Generative Adversarial Network Based On Complete Representation Learning For Multi-view Face Image Generation
8	Image Motion Deblurring Based On Conditional Generative Adversarial Networks
9	Research And Application On Particular Scene Generation Based On Generative Adversarial Network
10	Face Expression Generation Based On Multi-Domain Mapping Generative Adversarial Networks