Font Size: a A A

Research On Methods For Voice Covnersion

Posted on:2005-04-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:S LvFull Text:PDF
GTID:1118360155474038Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Voice conversion is a technology that modifies a source speaker's speech utterance to sound as if a target speaker had spoken it. Voice conversion is a new area of speech technology; the studies on it will promote the research of speech analysis, speech coding, speech synthesis, speech enhancement, speech recognition and so on. In this dissertation, two types of voice conversion are addressed: single language voice conversion and cross language voice conversion. The main work is as follows: 1. An improved method of transforming the spectral envelope is proposed. It is an important component of the voice conversion system, and a baseline system compared to other systems while evaluating. The transformation function is implemented as a regressive, joint-density Gaussian mixture model, trained on aligned LSF vectors by an expectation maximization algorithm. The analysis/synthesis model is linear prediction, which is a well studied model. Most of the vocoders, such as CELP, MELP are based on this model, which make it better for the applications when less storage space is needed. There is no any limit on the speaker's speech, while the training and testing speech is very nature. During the training produre, the data which is not well aligned is rejected. Informal listen test shows that the transformed speech sounds like the target speaker, and the intelligibility and naturaness is very high. The Objective evaluation shows that our system outperforms other similar systems. 2. We improve upon the baseline by adding a residual prediction module, which predicts target LPC residuals from transformed LPC spectral envelopes, using a classifier and residual codebooks, which should be trained using the residual signals of the target speaker. A high quality voice conversion system is composed of the spectral envelop transforming system and the residual prediction module. Informal listening test shows that the transformed speech also sounds like the target speaker, and the prosody information of the source speaker is kept. However, there are buzzing and other artifacts, which are typically associated with RELP manipulation. 3. A phoneme based approach is presented. In this method, GMM is not being used to partition the acoustic space of the speaker. Acoustic space of a speaker is partitioned explicitly into phonemes using the alignments and GMM is used for finer modeling of each phoneme. It not only avoids the effects of DTW in the training of transformation function. Objective tests show that voice conversion quality is comparable to other approaches that require a parallel speech corpus. 4. A primary research is also proposed for cross language voice conversion. We compare the Chinese phonemes with the English ones, and is shows that there are also some similar phonemes in both languages. So, automatic phonetic class segmentation and mapping approach is presented. After locating corresponding classes of source and target speaker, we are able to apply conventional parameter training methods for voice conversion. Experimental results of the proposed class segmentation and mapping between Chinese and English is been shown. With the development of speech technology, more and more human machine interaction products are used in our dayly life, the ways to improve naturaness is becoming an important matter of concern. Voice conversion will serve as a useful tool in this area because it provides new insights to personification of speech enabled systems, and it is far-reaching significance in theory and application.
Keywords/Search Tags:Voice Conversion, Gaussian Mixture Model, Phoneme, Cross Language Voice Conversion
PDF Full Text Request
Related items