Font Size: a A A

Text/Speech-Driven Talking Face Generation With High Naturalness

Posted on:2021-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Y YuFull Text:PDF
GTID:1368330602494194Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Given an arbitrary speech clip or text information as input,talking face generation aims to generate photo-realistic lip-sync face animations.This task has a wide range of important practical applications,such as film production and digital computer games.In addition,talking face generation can also provide the visual information concern-ing the place of articulations,so it can be applied to language tutoring or the adjuvant treatment for patients with a hearing impairment.However,this task is rather challeng-ing because it is the mapping from one-dimension speech signals or text information to three-dimension videos.Besides,to generate a series of photo-realistic videos,it is necessary to consider many factors,such as the naturalness of facial expressions,the temporal dependency between adjacent frames,the synchronization between lips mo-tion and speech,and humans are sensitive to subtle abnormalities in facial motion and audio-visual synchronization.Based on a thorough review of previous work,we study the talking face genera-tion from three aspects(i.e.,3D-based talking face generation,2D-based talking face generation and 2D-3D combined talking face generation).Specifically,the main work is listed as follows.First,for the 3D-based talking face generation,the 3D face model is driven by the articulatory movements to generate face animations.This method can be divided into two parts:audio/text-to-visual conversion and 3D face modeling.In this the-sis,audio/text-to-visual conversion aims to convert audio signals or text information to a sequence of articulatory movements.For this part,the hidden Markov model(HMM)and deep learning are employed for articulatory movement prediction.For the HMM method,we study the articulatory movement prediction with text input,and compare the predicted accuracy of the monophone HMM,triphone HMM and fully context-dependent HMM,respectively.Through extensive experiments,the fully context-dependent HMM achieves the best performance.For the deep learning method,a bottleneck long-term recurrent convolutional neural network(BLTRCNN)is proposed for articulatory movement prediction.In this network,the bottleneck features are gen-erated as a compact representation of the sparse linguistic features.It is believed the bottleneck features can capture information which is complementary to input features.Therefore,the bottleneck features,as the complementary linguistic features,are inte-grated with the original linguistic features and the acoustic features into this network for better performance.Besides,in BLTRCNN,the skip connection is introduced for articulatory movement prediction.It can concatenate the feature maps,learned from different layers,to increase variation in the input of subsequent layers and improve effi-ciency.Moreover,we also compare the performance with different inputs(i.e.,only audio input,only text input,and both text and audio inputs).After extensive experi-ments,the proposed BLTRCNN achieves the state-of-the-art root-mean-squared error(RMSE)when the textual information and acoustic features are integrated as the input.Second,for the 2D-based talking face generation,a deep learning-based method is adopted to synthesize high-resolution,arbitrary identity and lip-sync face animations given audio or text as input.This method can be split into two parts:mouth shape prediction and video generation.For the mouth shape prediction,a time-delayed long short-term memory(LSTM)is adopted.With the time delay steps,this network can not only make full use of the past information but also explore future context to dramatically improve the quality of results.For the video generation,Face2Vid network is proposed to generate lip-sync face animations.In Face2Vid,the optical flow is introduced to model the temporal dependency between adjacent frames to generate temporally coher-ent videos and realize the smooth transition of facial movements.Besides,in Face2Vid,a self-attention mechanism is also employed to model the spatial dependency to cap-ture the global,long-range dependency across facial images.In this work,the proposed method is composed of fully trainable neural modules and exhibits strong generalization capability in generating the heads of different persons.Third,for the 2D-3D combined talking face generation,it can not only ensure the operability of the model and generate face animations with natural pose,but also retain the details of the image texture and synthesize photo-realistic face animations.For this task,we propose a method based on 3D face modeling and 2D video synthesis.This method includes three parts:(1)Estimate the 3D face shape from a single image of a person.For this part,the RingNet network architecture is adopted,which is robust and can produce the similar shape from images of the same subject and different shapes for different subjects.(2)Given an audio clip and the 3D face model,the VOCA network architecture is adopted to generate lip-sync 3D face animations.(3)Given a target video and a 3D face animation,a sequence of face sketches can be generated,and then the video synthesis network is proposed to generate photo-realistic face animations.
Keywords/Search Tags:Talking Face Generation, Articulatory Movement Prediction, 3D Face Model, Video Synthesis, Multimodal Inputs, Adversarial Training
PDF Full Text Request
Related items