Font Size: a A A

Research On Speech Driven Talking Face Video Generation

Posted on:2022-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:W T WangFull Text:PDF
GTID:2518306542466684Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Speech driven talking face video generation means to generate talking face video using any audio and any head image of a given person.This technology has been widely used in movie making,virtual news broadcasting,virtual speech,etc.At present,the research on speech generation of talking face video mainly focuses on the quality of facial synthesis and the accuracy of lip movement,but neglects the speaker's head motion.However,the structural similarity of the key points is one of the main factors that affect the accuracy of lip movement in the face synthesis with the face landmarks as the intermediate variables.In the previous studies,the head motion synthesis details of talking face video synthesis were not satisfactory.In order to solve the above problems,this paper proposes a method using facial landmarks as the intermediate variables to generate natural head movements,accurate lip movements and high-quality facial expressions.The quantitative and qualitative results show that the proposed method can synthesize clear,natural,and head-motion speaker facial video,and its performance is better than the existing methods.The main contents and innovations of the thesis are as follows:First,a speech split neural network with face landmarks as intermediate variables is studied.The speech information is decomposed into head motion and semantic information by convolution neural network and cyclic neural network.By separating head landmarks and the lip landmarks,the head motion information and the semantic information in the input speech correspond to the face contour landmarks and the lip contour landmarks respectively.Secondly,a loss function is studied to optimize the accuracy of facial landmarks.The function can dynamically adjust the loss of facial landmarks during the training.This method solves the problem of underfitting caused by the similarity of key points in face structure,and ensures that the network can still be trained stably in the training process.Thirdly,a talking face video generation network is studied,which synthesizes face video through continuous lip landmarks sequence and head landmarks sequence and template images.Based on this,the channel attention mechanism is introduced,so that the network can get more accurate head motion information and the semantic information of lip landmarks through the attention mechanism.
Keywords/Search Tags:Talking Face, Facial Landmark, Lip Motion, Head Motion, Face Video
PDF Full Text Request
Related items