Font Size: a A A

Research On Crucial Techniques In Chinese Text To Speech System

Posted on:2009-06-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:P M HuangFull Text:PDF
GTID:1118360278965426Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text-to-Speech (TTS) is a useful technology that converts arbitrary text into a speech signal. It can be applied to various fields, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading.Although large corpus based systems have been able to generate high speech quality, but there are still some shortcomings. In particular, it can not be applied to devices with limited resources, due to the huge storage demand. At present, there are generally two types of solutions, one is to use new methods such as HMM based speech synthesis system, and the other is to reduce the redundancy of corpus greatly under the premise of maintaining high speech quality (small corpus TTS system). Both of the two methods can be used to reduce the storage demand significantly. The latter method can obtain better output speech but the storage demand is a little bigger, comparing with the former method.In this paper, some critical issues are further researched for the small corpus TTS system. The research and innovations are described in details as follows:1. Design of synthesis unit inventory and construction of prosodic model are two key issues for small corpus TTS system. But they are dependent on a large corpus with labeling information. Among the labeling task, precise speech segmentation and labeling are very important. To solve the problem, an automatic segmentation and labeling method that combines statistics approaches with rules is proposed. Two types of HMM models are utilized to produce the INITIAL/FINAL and syllable boundaries. Three feature detection algorithms are applied to boundary refinement for speech boundaries of voiced/unvoiced/silence. Experimental results show that the proposed method can improve the performance of the segmentation system significantly.2. The clustering problem of syllable pitch contours is studied. By doing clustering and reasonable sample selection, the size of the large speech corpus can be significantly reduced. Besides, by introducing the speech coding technique, a small-size multi-sample tonal mono-syllable corpus can be built to satisfy the demands of clarity and naturalness for small corpus TTS system or embedded TTS systems. For pitch contours with different lengths, a non-fixed-length contours clustering approach is proposed. This approach introduces the idea of dynamic programming (DP) into clustering. Firstly, the pitch of contours is normalized (zero-mean). Then, the best path is found between two contours using the DP method. Finally, the distance measure of two contours along this path is calculated. If the shapes of the two pitch contours are similar, the distance measure value will be very low. In the stage of sample selection, the tone domain of syllables is divided by pitch means and then the typical samples are identified according to their levels and clusters. Clustering experiments show that better clustering results can be achieved by this approach compared with the traditional approaches. And new clustering approach is also validated by synthesis experiments. 3. A prosodic model is proposed. It can be used to predict the pitch contours of sentence. The method of doing that is as follows: (1) The pitch contour templates are obtained by clustering; (2) The decision tree method is used to construct a prediction model from contextual information of syllable to pitch contour templates; (3) According to difference contexts, the control parameters of syllable pitch contour templates as pitch mean, the syllable duration and the INITIAL duration will be computed respectively and the acoustic parameters index trees will be constructed for each kind of tonal syllable. (4) The pitch contours of sentence will be obtained via syllabic contexts, pitch contour templates and its prediction model, the acoustic parameters index trees, and silence durations.
Keywords/Search Tags:Text-to-Speech System, Speech Automatic Segmentation and Labeling, Speech Corpus Reduction, Prosodic Modeling
PDF Full Text Request
Related items