Font Size: a A A

Speech Driven Facial Animation Synthesis Based On Deep Learning Network Model

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y YueFull Text:PDF
GTID:2428330614971746Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The speech-driven face animation generation project is the actual project of the author's company.The specific application scenario is the "face to face" conversation between the user and the voice assistant mounted on the equipment such as elevator and washing machine in the 5G era.Using speech to synthesize facial animation is to process sound and synthesize mouth movement and facial expression corresponding to speech.This research direction is an important content in the field of natural human-computer interaction,which can be regarded as a kind of voice-oriented translation.According to relevant psychological knowledge,compared with ordinary auditory experience,visual and auditory input can better improve users' interactive experience.By accepting the matching of auditory and visual animated human faces,the users of voice assistant will have a better feeling of the speech content they hear,and reduce the sense of unreality and uncertainty generated by the users due to pure voice input.From the above purposes,the author designed a speech-driven facial animation synthesis system,aiming at driving the movement of facial feature points in facial pictures through voice control,so as to achieve the facial animation effect matching the speech.A realistic facial animation involves movements of eyebrows,eyes,lips,teeth,and so on.Based on the above purposes,the author designed a compound neural network model(CNN-LSTM),combined with conditional generation antagonistic network to realize the project,and moved the parts represented by each feature point reasonably through the learning process of neural network.The main research contents and innovations of this paper are as follows:(1)By designing and implementing the CNN-LSTM network model,MEL spectrum map of audio was taken as the training data of the network,68 two-dimensional coordinates of facial feature points were taken as the labels of training data,and the data were input into the CNN-LSTM network for training in accordance with the sequence of frames in the video.The CNN network will extract the features of the Mayer spectrum for supervised learning,so as to connect the Mayer spectrum with the facial feature points.LSTM performs correlation learning of temporal input data.The audio Meir spectrum map of the data set is input into the CNN-LSTM neural network,and the output result is the facial feature point picture corresponding to the audio.(2)By designing and implementing the conditional generation of the confrontationnetwork model,the face feature point image output in the CNN-LSTM model is restored to the real face image.Finally,the continuously changing face image is synthesized into a "gifs" through FFmepg multi-frame static image combination,and audio and "gifs" are combined to form a video.At present,this paper obtained the facial feature points predicted by CNN-LSTM based on experiments.The mean deviation of the model results is small,the convergence speed is fast,and the expression feature points between the continuous outputs change significantly.For the facial images generated by conditional generation antagonist network,when the training rounds are 200,the mean square error between the generated image and the real image is 0.1082,which can basically restore the facial facial image.At present,the neural network model designed by the author can generate a video animation matching the speech and facial movements only with a piece of audio and a real face picture.
Keywords/Search Tags:Speech driven, Face animation, Video synthesis, Mel Bank Features, Facial feature point
PDF Full Text Request
Related items