Font Size: a A A

Talking Face Generation Driven By Audio And Action Units

Posted on:2022-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:S ChenFull Text:PDF
GTID:2558307154474884Subject:Engineering
Abstract/Summary:PDF Full Text Request
Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips.This task can greatly promote the interaction between human and avatar.Aiming at the limitation of the existing methods often only use audio driving information and ignore the local information of the mouth muscles,this paper proposes a novel generative framework,which uses not only audio as driving information,but also speech-related facial action units(AUs)as local driving information of mouth muscles to accurately control mouth motion.In addition,talking face generation is a cross-modal task.To overcome the limitations of existing methods that directly concatenate cross-modal features together and ignore the relationship learning of them,this paper also proposes the dilated non-causal temporal convolutional self-attention network(TCSAN)for multimodal fusion module to promote the relationship learning of cross-modal features.Firstly,this paper studies talking head generation driven by speech and facial action units.Speech-related AU information can guide mouth movement more accurately.Since speech is highly correlated with speech-related AUs,an audio-to-AU module is proposed to predict the speech-related AU information from speech.In addition,the pre-trained AU classifier to supervise that the generated images contain correct AU information.This paper fully verifies the effectiveness of the model on the GRID dataset and TCD-TIMIT dataset.A large number of quantitative and qualitative experiments demonstrate that the proposed method outperforms existing methods in both image quality and lip-sync accuracy.To promote the relationship learning of cross-modal information,this paper also studies the method of multimodal representation fusion based on TCSAN.It not only achieves a larger receptive field by a dilated non-causal convolution to maintain the temporal relationship of the generated frames,but also integrates the multi-head selfattention mechanism to promote the relationship learning of cross-modal features.By comparing with other mainstream fusion methods,the experiments fully demonstrate the advantages of TCSAN proposed in this paper.This paper also compares the full model with the existing methods after adding the audio-to-AU module and the TCSAN module.A large number of quantitative and qualitative evaluation results fully demonstrate the superiority of the talking head generation system proposed in this paper.
Keywords/Search Tags:talking head generation, video synthesis, speech animation, facial action units
PDF Full Text Request
Related items