Font Size: a A A

Research Of Personalized Speech Generation

Posted on:2012-11-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z W ShuangFull Text:PDF
GTID:1118330335462383Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Personalized speech generation is to generate speech with the characteristics of a target speaker. There are many applications of personalized speech generation. An important application is to build customized text-to-speech system for different companies, in which a TTS system with one company's favorite voice can be created quickly and inexpensively by modifying origin speaker's speech corpus. Personalized speech generation can also be used for hiding speaker's identiy during chatting and on-line gaming or mimicking another person's voice in multimedia message for entertainment. Crrently, there are two popularly used personalized speech generation method: 1. voice conversion, 2. speech synthesis model adaptation. Both methods have their own advantages and disadvantages, which can be used for different applications. In this thesis, we analyze the characteristics and connections of these two methods, and make improvements according to the existing problems of different methods and the practical requirements of real applications. Evaluation results prove the effectiveness of our improvements.In the first chapter, we summarize the speaker characteristics, the requirements of personalized speech generation and the merits and appropriate usage scenarios of different personalized speech generation methods. We first introduce pronunciation models, based on which we summarize different speaker characteristic features. Then we analyze the practical requirements of different personalized speech generation applications, and discuss the characteristics and appropriate applications of different methods.In the second part, we make a detailed introduction and analysis of two most popular groups of methods for voice conversion: those by GMM and those by codebook mapping. We first introduce the GMM based methods and several most important variations, and then introduce the traditional codebook mapping method proposed by Abe and STASC coding by Alsan. Then, we compare and analyze the advantages and disadvantages of these two mehtods. Finally, we discuss the common problems of these two methods that we find in practical application: 1. the mismatch between aligned training data of source speaker and target speaker. 2. The oversmoothing problem of the converted spectrum. These comparisons and discussions guide us to investigate a new voice conversion method.In Chapter 3, we propose a novel voice conversion method using frequency warping according to the problems of current methods. The frequency-warping function is generated based on mapping the formants of the source and target speakers. With this proposed voice conversion method, only a very small amount of training data is required to generate the warping function, thereby greatly facilitating its application. To further improve the similarity to the target speaker, we propose a new method that combines frequency warping and unit selection of the target speaker's real spectrum. We use frequency warping to generate the warped source spectrum, which is used as an estimated target for the later unit selection of the target speaker's spectrum. Part of the warped source spectrum is then replaced by the selected target speaker's real spectrum before the converted speech is reconstructed. Formal voice conversion evaluation results show that the proposed frequency-warping method can achieve a much better quality of converted speech than other methods while also achieving a good balance between quality and similarity. Evaluation results also show that the combined method can significantly improve the similarity score when compared to using only frequency warping.In Chapter 4, to solve the practical problem that we meet in speech synthesis system for mixed language, we implement a mixed language speech systhesis system based on a novel personalized speech generation method combining speech synthesis model adaptation and voice conversion technology. When synthesizing Chinese text mixed with English text, it is usually preferred to synthesize the mixed languages content with a single voice. However the synthesized English of HMM based TTS may sound unnatural if the models are directly built with a Chinese speakers'unprofessional English data. In this paper, we proposed to use personalized speech generation to leverage a native English speaker's model to generate more natural English for the Chinese speaker. MLLR speaker adaptation method is used to adapt the spectrum models of a native speaker, while the prosody adjustment of voice conversion is applied on the prosody models for a better prosody. In synthesis stage, mixed language contents share a unified prosody tree to improve the continuity between Chinese and English contents. Evaluation results show that the proposed method significantly improve the speaker consistency and naturalness of synthesized speech for mixed language text compared to using directly built models. It is worth mentioning that this system has been used in the offical website of Shanghai EXPO 2010 to help visual impaired people to listen to the web content.Chapter 5 summarizes this article, and discusses the future work.
Keywords/Search Tags:Personalized Speech Generation, Voice Conversion, Speech Synthesis, Frequency Warping, Adaptation, Mixed Language
PDF Full Text Request
Related items