Font Size: a A A

Acoustically-driven talking face animations using dynamic Bayesian networks

Posted on:2009-12-14Degree:Ph.DType:Thesis
University:University of California, Los AngelesCandidate:Xue, JianxiaFull Text:PDF
GTID:2448390002995462Subject:Engineering
Abstract/Summary:PDF Full Text Request
Visual speech information on a speaker's face is important for improving the robustness and naturalness of both human and machine speech comprehension. Natural and intelligible talking face animations can benefit a broad range of applications such as digital effects, computer animations, computer games, computer-based tutoring, and scientific studies of human speech perception. In this study, the focus is on developing an acoustically-driven talking face animation system. Acoustical speech signals are found to be highly correlated with visual speech signals, and thus can be used effectively to drive facial animations.;The acoustically-driven talking face animation system is developed using an audio-visual speech database. The database used in this study includes a previous recording (CorpusA), a pilot diphone-oriented recording (CorpusB), and a new recording (CorpusC). The raw optical data from the new recording are processed through an archiving pipeline. Acoustical and optical data are first segmented into tokens, and then acoustical data are segmented into phonemes through HMM forced-alignment.;Dynamic Bayesian networks (DBNs) are applied to the acoustic-to-optical speech signal mapping in the acoustically-driven talking face animation system. Different DBN structures and model selection parameters are studied. Experimental results show that the state-dependent structures in the DBN models yield high correlation between reconstructed and recorded facial motions. More interestingly, the maximum inter-chain state asynchrony parameter of the DBN configurations has a greater effect on synthesis accuracy than the number of hidden states in the audio and visual Markov chains. This study demonstrates the potential of DBNs in acoustically-driven talking face synthesis.;An optical data-driven animation rendering tool is built based on radial basis functions. Synthesized optical data and recorded optical data are both used to generate animations for system evaluation. A lexicon distinction identification test is conducted with 16 human subjects. Perceptual test results on original optical data-driven animations show that the radial basis function algorithm provides highly natural rendering of talking faces. Perceptual test results on synthesized optical data-driven animations show that for some words the synthesized results yield similar lexicon distinction identification scores to the results using recorded data-driven animations. The formal perceptual test provides quantitative evaluation of the entire acoustically-driven talking face animation system, which can be very useful for future system tuning and improvement.
Keywords/Search Tags:Acoustically-driven talking face, Speech, Using, Optical data
PDF Full Text Request
Related items