Font Size: a A A

A Research On Speech Synthesis Based On Statistical Modeling And Pronunciation Error Detection

Posted on:2012-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LuFull Text:PDF
GTID:1118330335962442Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of the quality of speech synthesizer and the method of the statistical modeling, statistical model based parametric speech synthesis method and statistical model based unit selection and concatenation method have been proposed and made significant progress in the last decade. Among which, the Hidden Markov Model (HMM) based parametric speech synthesis method and the HMM based unit selection and concatenation system draws more and more attention. Compared with the traditional unit selection and concatenation speech synthsis method, HMM based parametric speech synthsis method has the following features: the system building is faster. The synthesis voice is more intelligent, more fluent. The footprint for the HMM based system is much smaller. And the synthsis voice can be tuned more flexibly. Moreover, HMM based unit selection and concatenation method draws more attention recently. Different from the traditional unit selection and concatenation method, HMM based unit selection and concatenation method employs the HMM statistical model to conduct unit selection process, which reduces the discontinuity in concatenative synthesis speech.However, HMM based parametric system has its own disadvantages. Because HMM based system employs Maximum Likelihood (ML) or Minimum Generation Error (MGE) criterion to generate acoustic parameters for synthesis speech from HMM model, then the acoustic parameters are put into the speech synthesizer to synthsize speech voice, so the voice quality by HMM parametric system can not be as fine as the voice by unit selection system, and may sound buzzy like a robot. Three factors cause the buzzy sound: (1) Vocoder (Speech analyze / synthesizer). (2) The inaccuracy of the HMM based acoustic feature modeling. (3) The over smooth of statistical modeling. Moreover, though the naturalness and quality evaluation of the synthesis speech is based on the MOS (More opinion score), which is a subjective score, none of the statistical parametric system and the concatenation system includes direct human perception into speech synthesis system construction as criterion.With respect to the inaccuracy of the statistical modeling in current method, we propose to model phone duration with full convariance HMM model. And to optimize the MDL based context-dependent decision tree clustering process using the Minimum Cross Generation Error criterion. And according to the fact that none of the current speech synthesis systems involves direct human perception as criterion in system building, this dissertation proposes to include human perception directly into speech synthesis system building process based on the pronunciation error detection method.The whole dissertation is organized as follows:Chapter 1 is introduction. It introduces the state-of-the-art HMM based parametric speech synthesis system, including basic ideas, framework, advantages and dis advantages. And it also introduces the recent improvements of the HMM based parametric speech synthesis system.Chapter 2 focuses on the full covariance HMM model for the phone duration in synthesis speech. As in the traditional HMM based speech synthesis system, diagonal covariance HMM is used for phone duration model, which does not incorporate correlations between HMM states in both HMM modeling and duration parameter generation. Accordingly, we propose to model phone duration with full covariance HMM model in both context dependent HMM decision tree clustering and parameter generation. Experiments show that compared with traditional method, full covariance HMM duration modeling improves in the naturalness of duration.In Chapter 3, my work on decision tree pruning is introduced. To alleviate the inaccuracy of the traditional MDL (Minimum Description Length) based HMM clustering, Minimum Cross Generation Error (MCGE) criterion is proposed to conduct the two step decision tree pruning. Experiments show that synthesis speech quality by proposed method out-performs the traditional MDL based one.Regarding that none of the state-of-art speech synthesis method directly includes human perception into speech synthesis system construction, we incorporate the first time subjective perception into synthesis speech system building process. In Chapter 4, traditional CALL (Computer Assist Language Learning) system is introduced first. Then the necessity for feeding back the human perception on synthesis speech to speech synthesis system building process is discussed. Thirdly, the method of error detection for the labeling of synthesis speech training dataset, the synthesis speech pronunciation error detection method, and the pronunciation error detection based speech synthesis method are proposed, include both principles and experiments. The basic ideas for the Support Vector Machine (SVM) and Kernal Fisher Discrininant (KFD) are also introduced. Subjective and objective experiments show that error detection method for speech training database labels reduces the label errors in the database, synthesis speech pronunciation error detection method can locate pronunciation errors in synthesis speech, and pronunciation error detection based speech synthesis method out-performs the tradition speech synthesis method.The Blizzard Challenge evaluation is a world wide speech synthsis evaluation. And it is presented in Chapter 5. The Blizzard Challenge 2009, including system building for sub-tasks and evaluation results are described. My work in Blizzard Challenge evaluation is also introduced.Conclusions for the whole dissertation are given in Chapter 6.
Keywords/Search Tags:speech synthesis, Hidden Markov Model, parametric speech synthsis, unit selection, full covariance, decision tree pruning, error detection
PDF Full Text Request
Related items