Font Size: a A A

The Study Of Speech-driven Talking Avatar

Posted on:2018-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z TangFull Text:PDF
GTID:2348330515469126Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The speech-driven talking avatar technique synthesizes corresponding facial animation by inputting speech information.It not only helps users understand the speech,but also provides a realistic and friendly way of interacting with a computer.With the development of the technology,it will bring more new interactive experience for us.In this thesis,two schemes are used to study the speech-driven talking avatar animation synthesis.The first scheme is speech-driven articulator motion synthesis with deep neural networks.The second scheme is speech-driven talking avatar animation synthesis based on MPEG-4.The results of the two schemes are compared and analyzed.Corresponding corpus is necessary for both two schemes.Then suitable audio-visual database should be extracted and constructed from it in order to solve the problems in this thesis.In the first scheme,there are direct connections between speech production and the motion of the articulators,such as the positions and movements of the lips,tongue and soft palate.We realize acoustic-articulatory mapping by deep neural network.The input of the system is acoustic speech and the output is the estimated articulatory movements on a three dimensional avatar.Firstly,comparing the performance between artificial neural network(ANN)and deep neural network(DNN)under a series of parameters,the optimal network is obtained.Secondly,for different context acoustic length configuration,the number of hidden layer units is tuned for best performance.So we get the best context length.Finally,we select the optimal network structure and realize the avatar animation by using the articulatory motion trajectory information output from DNN to control the articulator motion synthesis.In the second scheme,speech-driven talking avatar animation synthesis based on MPEG-4 is a data-driven method.First,we extract and construct suitable audio-visual corpus from the LIPS2008 database.Then the back propagation(BP)network is used to study the mapping between acoustic feature parameters and avatar facial animation parameters(FAP).Finally,the system controls facial model to synthesize lip animation by the predicted FAPs sequence output from BP network.We compare the animation synthesis of the two schemes by objective and subjective evaluation.The evaluations prove that both the two schemes can vividly and efficiently realize talking avatar animation synthesis.Comparing the two schemes,we need to construct a suitable lip model in the first scheme.Although it has a high accuracy,its universal capability is not very strong and its corpus is not easy to acquire.The second scheme conforms to MPEG-4 standard and controls facial model to synthesize animation by FAPs sequence.It is more universal and extensible,and it is more suitable for wide use.
Keywords/Search Tags:Talking avatar, Animation synthesis, Speech-driven, Deep neural network, MPEG-4
PDF Full Text Request
Related items