Font Size: a A A

Research On Speech Quality Evaluation For Tibetan Statistical Parametric Speech Synthesis

Posted on:2016-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:S P XuFull Text:PDF
GTID:2308330470980039Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Since statistical parametric speech synthesis, which can easily changes voice timbres and needs small model storage space, can use limited training corpus synthesize speech with different speaker, different speaking style and different emotion, this speech synthesis technology becomes a main method of speech synthesis in recent years. The thesis focuses on the voice quality evaluation of synthesized Tibetan speech by a Tibetan statistical parametric speech synthesis. A speech unit automatic segmentation method is proposed for the time boundary labeling of speech unit. The thesis also studies the effects of variant speech unit and different time boundary labeling for synthesized Tibetan speech by the Tibetan statistical parametric speech synthesis. At the same time, speaker recognition method is also adopted to evaluate the similarity between the synthesized Tibetan speech and the speech of the original Tibetan speaker. The main works and originalities are as follows:Firstly, a speech unit automatic segmentation method is proposed for Tibetan statistical parametric speech synthesis. Deterministic Annealing Expectation Maximum(DAEM) algorithm is adopted into the HMM-based Tibetan speech synthesis to label the time boundary of speech synthesis unit for non-labeled training speech corpus automatically. The initial and the final are used as the speech synthesis units. The DAEM algorithm is used for determining the optimal parameters of the embedded re-evaluation during the model training. The boundaries of speech synthesis units are obtained by a force alignment in acoustic model training of speech synthesis unit. Tests show that the unit boundary obtained by the proposed method is close to the manually labeled boundary.Secondly, the effects of variant speech unit and different time boundary labeling for synthesized Tibetan speech are evaluated in the thesis. HMM-based Tibetan statistical parametric speech synthesis is trained with annually labeled Tibetan speech corpus and automatically labeled Tibetan speech corpus respectively. The speech qualities synthesized by the acoustic models trained with the initial and the final as the synthesis unit compare with the speech qualities synthesized by the acoustic models trained with the syllable as the synthesis unit. Tests show that the synthesized Tibetan speech has bad speech quality when the training corpus is small. The speech quality is improving with the increment of training corpus for both kinds of synthesis units. The acoustic model with syllabled-based and initial/final-based synthesis can synthesize similar Tibetan speech when training corpus is plenty. At the same time, for the syllable-based speech corpus, the synthesized Tibetan speech with automatically labeled time boundary is worse than the synthesized Tibetan speech with manually labeled speech corpus.Finally, a speaker recognition method is employed to evaluate the similarity of speech between the synthesized Tibetan speech and the original Tibetan speech. Speaker recognition is realized by combining empirical mode decomposition(EMD) and short-time analysis of the speech signal. The speech similarity is measured by the results of speaker recognition. Experimental results show that the synthesized Tibetan speech is similar with the original speech.
Keywords/Search Tags:speech quality evaluation, Tibetan speech synthesis, statistical parametric speech synthesis, HMM
PDF Full Text Request
Related items