Font Size: a A A

Research On Data-driven 3D Facial Animation

Posted on:2016-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:C W LuoFull Text:PDF
GTID:1108330473961623Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Facial animation is a hot research topic in both academia and industry. Its applications include movie industry, computer games, human-computer interaction and teleconferences.3D modeling of human faces is the basis of facial animation. This is because the realism of generated facial animation relies heavily on the realism of the constructed face model. After obtaining the 3D face model, animating the face model is also an important issue.Based on a thorough review of previous work, we perform our research on 3D face modeling and animation. Specifically, our work includes the following aspects.With respect to face modeling, we focus on image based modeling techniques. We propose a method for reconstructing a 3D face model from a single image. Firstly, we locate the facial feature points in the input image. Next, a statistical face model based on principal component analysis (PCA) is fitted to the detected feature points. Finally, radius base function interpolation is used to refine the constructed face model. This method can construct a personalized 3D face model in a few seconds.In general, depth estimation from a single image is unreliable. Therefore, we further propose a face modeling method based on binocular stereo vision. Due to the sparseness of salient facial features, traditional stereo matching algorithms based on texture correlation cannot acquire reliable dense correspondence for face stereo images. Instead of using dense stereo matching, we first match the salient facial features of the stereo images, and recover their sparse 3D point cloud. Next, an initial face model is constructed by fitting a PCA face model to the 3D point cloud. Then, the vertexes of the initial face model are projected to the left and right images, and the color difference is calculated between the corresponding projections. Finally, the initial face model is iteratively refined according to the 3D point cloud and the color difference.With respect to face animation synthesis, we focus on data-driven 3D facial animation, including performance-driven facial animation, speech-driven facial animation, text-driven facial animation and tongue animation.Synthesizing performance-driven facial animation is the process of animating a face model according to the facial actions of the user (performer). There already exist many performance based approaches which can generate highly realistic facial animations. However, these approaches typically require facial markers or special equipment for facial performance capture. In this paper, we propose a performance driven facial animation system for ordinary users. The system enables a user to animate an avatar by performing desired facial motions in front of a video camera. In our system, constrained local model (CLM) is used to track the feature points in the video. CLM only makes use of local texture and performs an exhaustive local search around the current estimate of feature points. This often leads to local minima. To improve the tracking accuracy, we incorporate the global texture of AAM into CLM. The improved CLM not only gives discriminative capability to each landmark, but also gives good match to the whole texture, leading to better tracking accuracy. After obtaining the 2D positions of the feature points, we estimate blendshape coefficients based on a set of user-specific 3D key shapes. Finally, facial animations are created using blendshape interpolation.To further improve the performance of face tracking, we propose a tracking method based on probabilistic random forest (PRF). The PRF is the same as standard random forest except that it models the probability of a sample reaching each leaf node. Experiments show that our method significantly outperforms the state-of-the art in terms of fitting accuracy.Speech-driven facial animation synthesis is animating a virtual face so that the synthetic lip movements match an input audio signal. Audio-to-visual conversion is the core of speech-driven facial animation, In this paper, Gaussian Mixture Models (GMM) are employed for audio-to-visual conversion. The conventional GMM based method performs the conversion frame by frame using minimum mean square error estimation. The method is reasonably effective. However, the performance of the conversion is insufficient. Discontinuities often appear in the sequences of the estimated target vectors. The reason is that the conversion is independently performed at individual frames. The correlations between frames are ignored. We propose incorporating previous visual features into the conversion. Then the estimated visual feature depends on not only current audio feature but also its previous visual feature. Experiments show that our method greatly improves the conversion accuracy. In our system, Mel-Frequency Cepstral Coefficients are used as audio features and the positions of feature points are used as visual features. After predicting the positions of feature points from speech, we synthesize facial animation using blendshape interpolation.Text-driven facial animation is also known as text-to-visual speech synthesis. Given the phoneme sequence, we predict the visual feature sequence using a hidden Markov model (HMM) based framework. Then the visual feature sequence is used to drive the virtual face.In most facial animation systems, the tongue has not been accurately modeled, which makes the output animations less convincing. The human tongue is the most important speech organ. Modeling the tongue is useful not only for improving the intelligibility of synthesized speech animation, but also useful for speech production research. We reconstruct an accurate 3D tongue model using magnetic resonance images. The tongue model is a triangular mesh with 254 vertices and 504 triangular faces. The deformations of the tongue model are controlled by three control vertexes, which are located at the tongue lip, tongue body, and tongue dorsum, respectively. To create highly realistic text-driven tongue animation, we record the movements of tongue tip, tongue body, and tongue dorsum of a speaker using electromagnetic articulography (EMA). The articulatory movements are used to train a set of HMMs. Based on these HMMs, we predict the movements of tongue tip, tongue body and tongue dorsum from text. The predicted movements are used to determine the position of the three control vertexes. The positions of the rest vertexes are determined by minimizing the deformation energy. Our tongue model is able to achieve various tongue shapes with volume preservation. Experiments show that the generated tongue animations are realistic.
Keywords/Search Tags:face modeling, facial animation, tongue animation, stereo vision, data-driven, performance-driven, speech-driven, text-driven, facial feature tracking, constrained local model, probabilistic random forest, Gaussian mixture model
PDF Full Text Request
Related items