Font Size: a A A

A Study On Speech Driven Human Face Modeling And Animation

Posted on:2012-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiFull Text:PDF
GTID:1118330362960463Subject:Military communications science
Abstract/Summary:PDF Full Text Request
Speech-driven human face modeling and animating technology belongs to the field of visual speech synthesis. Its method is firstly constructing the 3D model aiming for animation based on human face information, and then realizing the correspondent lip animation based on the given speech, which improves the audience's comprehension of the speech. Due to its simplicity and practicality of its application, such technology is of theoretical and practical importance for promoting the 3D game making, film dubbing, network media attack and countermeasure, distant teaching and visual communications.The dissertation has taken the human face modeling and speech-driven animating technology as principal research subject, which firstly proposes a 3D human face modeling approach, secondly based upon such model through extracting the lip movement parameters and building a realistic lip animating model controls the lip animation, and lastly by analyzing the input speech extracts the speech features to drive the the correspondent lip animation. The main contents and innovations of this dissertation are summarized as follows:A refined multi-template ASM algorithm is proposed, which performs global location and local location with their templates respectively and orderly. In local location, a narrow strip-map is first constructed on the feature points of each template, then the closed-form algorithm of image segmentation is employed to perform texture segmentation on narrow strip-map, and finally local templates are matched to the image to obtain feature-point information. Experimental results show the improved algorithm effectively treats the problem that the traditional ASM algorithm works inaccurately in locating the feature points in texture smooth regions, and such algorithm also enhances the detection accuracy of all the feature points.The traditional Mean-Shift algorithm is improved to perform lip tracking and detecting. By the implementation of target boundary region likelihood and Level Set model, the algorithm adjusts the size of the searching window timely and obtains the moving information of a speaker's lips. Sub-regions are placed around the center of the searching window in the Level Set model, and the likelihood between the sub-regions and the lip bounder is employed to perform lip detecting. Such method performs a more accurate contour extraction than the Level Set model that simply uses gradient information. The implementation of ASM and target boundary region likelihood ensures a more robust extraction of outer lip information, and supports data for lip animation.An approach incorporating the muscle model and the Mpeg-4 is proposed. The skin points and skeleton points and the controlling regions of muscles are defined in the Candide-3 face model. The skeleton points are employed to control the movements of the lip feature points. As for the points within the controlling regions of the muscles, the muscle model is used to adjust the non-feature points, and as for those points outside of the controlling regions, the lip animation definition tables are used to adjust the non-feature points. The implementation of Loop subdivision method and the simplified muscle model makes the animation more delicate and efficient. Experimental results demonstrate such controlling method effectively produces more realistic lip movements in animation.An initial/final segmentation method is proposed, which establishes the loss function and employs the periodicity of voiced sounds and the duration of initials and finals. The method firstly calculates the autocorrelation function of the speech, secondly establishes the loss function and performs voiced detection upon the results using the dynamic programming method, thirdly determines the detection scope of the initials according to their distributing rules, and finally segments the initials and finals from the two parts abutting on the beginning edge of the voiced region within the detection scope using the auditory event detection method. Experimental results show that the segmentation accuracy is improved as the segmentation is performed upon the voiced regions, and the impacts of noises and sound changes of Chinese are reduced, and promotes accuracy in speech-driven animation.A dynamic Chinese viseme model is proposed. Aiming at that Chinese is a syllabic language and its pronouncing process bears the characteristic of"rugby", the model deals with inner-syllabic and inter-syllabic modelings respectively. As for a syllable, lip sub-movement model based on initials and finals is used, which firstly extracts the lip feature parameters of initials and finals and get the simplified viseme model by categorizing the mouth shapes according to the parameters, secondly computes the mouth shape likelihood between lip sub-movements and the pronouncing process of syllables, and finally constructs the parameter model of the lip movements by which only a small amount of parameters can control the mouth shapes of Chinese pronunciation. As for inter syllables, weighting function of vowel impact grading is used to simulate the effect of co-articulation. Experimental results show that comparing to the Chinese visemes described by phoneme or triphone model, the method promotes the animation efficiency and obtains the balance between the realistic degree of lip animation and its working speed.
Keywords/Search Tags:speech-driven, face modeling and animating, active shape model, closed-form algorithm, boundary region likelihood, muscle model, initial/final segmentation, dynamic viseme, lip sub-movement
PDF Full Text Request
Related items