Font Size: a A A

Research On Bimodal Emotional Chinese Speech Synthesis

Posted on:2012-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:J YuanFull Text:PDF
GTID:2218330338461767Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The goal of next generation speech synthesis system is to deliver semantic information exactly and vividly with clear and natural synthetic speech. Main task of Bimodal Emotional Speech Synthesis is to let computer have the potential of synthesizing natural emotional speech and realistic facial expressions by establishing a virtual computer avatar. Bimodal speech synthesis and speech recognition are core technologies to realize human-machine interaction. It has important application value in information processing area.The thesis concentrates on 3D model construction, rendering, animation driven approach, emotional prosodic feature modeling and speech synthesis based on Pitch Synchronous Overlap-Add algorithm.In the aspect of facial modeling, VRML model parsing and rendering based on OpenGL have been done. The face model consists of 7 components which contain 6435 vertexes and 12280 faces in total. The model used in this paper is more complex than other related research, and achieves better life-like face details.Compare two animation driving methods, parameter control method and data driven method. Resolve motion problems of tooth, tongue and throat by improving data collection approach. In FAP parameter control method based on MPEG-4, Radial Basis Function and Raised Cosine Function are chosen to control mouth and expression respectively. Data driving method based on key frame interpolation use a cubic polynomial to interpolate key frames, and then compose viseme and expression frames by vector weighting superposition to generate successive animation. Results show that FAP parameter method can achieve slight change of expressions and lip shape. Data driving approach can produce new expressions through fusion key frames.To improve the naturalness of synthetic speech, we modified the wave concatenation algorithm. Prosody prediction unit and modification unit based on PSOLA are added. In synthesis stage, units were selected based on combination of decision tree and cost function. Simulation results show that the synthetic speech gives exact emotion and natural voice.This paper realizes a bimodal mandarin emotional TTS system, meets the requirements of real-time animation with large data. Synthesized speech can express emotional information in visual and audio aspects accurately and vividly.
Keywords/Search Tags:Bimodal Speech, Facial Animation, Text-to-Emotional Speech
PDF Full Text Request
Related items