Font Size: a A A

Design And Implementation Of Speech-driven Facial Video Synthesis System

Posted on:2021-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2518306023975299Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The research results of human evolutionary psychology show that people obtain more information from the dual-modal input of speech sequence and facial animation than any one of the single-modal input,and the understanding of information is more effective.Speech animation is a technology that synthesizes facial animation that is consistent or synchronized with the speech sequence.This technology has wide application value in the fields of human-computer interaction,movies,games,etc.It is one of the core technical foundations for generating facial expressions and animations of virtual characters.This dissertation mainly studies the establishment of a mapping model between facial features and speech feature parameters in speech animation,and designs and implements a speech-driven facial video synthesis system.Firstly,a mapping model between facial features and speech feature parameters based on the deep network Bi-LSTM is proposed.The model uses synchronous audio and video dual-modal information for training to obtain the mapping relationship between the speech feature parameters MFCC and the face CLM feature landmarks.Secondly,a speech-driven facial animation generation algorithm is proposed.The algorithm obtains the prediction output of face feature landmarks that drive speech through the well-trained mapping model.Based on this,it sequently combines affine transformation and video coding technology to realize the generation of face animation video.The experiment used about 1000 minutes of weekly radio speech video clips during Obama’s presidency as the training corpus.The experimental results of the mapping model show that the Bi-LSTM-based mapping model proposed in this dissertation is significantly better than the monodirectional LSTM,and inspired by parameters tuning the accuracy of prediction reached to 89.5%.The result of speech-driven facial animation generation experiment shows that the synthesized video has a natural and smooth effect,and the video frame rate reaches 100fps.For the same driving speech input,the average SSE of objective evaluation criteria reached to 9.19.The subjective evaluation criteria on fluency and fidelity of generated videos respectively obtained 7.84 and 8.98 for which the full mark is 10.Finally,according to the aforementioned mapping model and face animation synthesis method,a speech-driven face video synthesis system based on B/S architecture is designed and implemented.The system has good operability,it can be applied to synthesize a natural and synchronized facial video output for arbitrary driven speech.
Keywords/Search Tags:speech animation, face animation synthesis, speech-driven, Bi-LSTM, video synthesis
PDF Full Text Request
Related items