Font Size: a A A

Research On Modeling Methods For Voice Conversion

Posted on:2014-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:L H ChenFull Text:PDF
GTID:1228330398456596Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In additional to the linguistic content, speech signal contains acoustic information of the speaker characteristics. Speaker individuality plays an important role in speech communication. This dissertation focuses on voice conversion, which is a relatively new research area in the speech signal processing. It is a technique aims at modifying the speech of one speaker (source speaker) to make it sounds like uttered by another spe-cific speaker (target speaker) without changing the linguistic information. Voice con-version may have many theoretical and applicable meaning for speech signal process-ing. With the development in the last decade, a statistical parameter model, Gaussian mixture model (GMM), based voice conversion method has been proposed and soon becomes a mainstream approach because of its advantages, such as fast and automatic system construction, robustness, good similarity of the converted speech, smoothness and stability and so on, comparing with the conventional approaches. There are two aspects to be evaluated about the performance of a voice conversion method:similarity to the target speaker and naturalness of the converted speech. The speech converted by the state-of-art GMM based approach has achieved quite good similarity, but its voice quality is degraded and leads a large gap to the natural speech in naturalness. Besides, this approach is less flexible because of its special requirement on training data.This dissertation focuses on the application of statistical modeling for spectral con-version, we propose several methods to improve the conversion performance from two aspects:model and feature. In the model aspect, we first propose a GMM based con-version method with explicit feature transform, then we propose to use restricted Boltz-mann machines (RBMs) instead of Gaussians for joint spectral modeling and conver-sion. In the feature aspect, we propose two method to decompose speech signal into speaker specific and content specific information:speaker independent model for lin-guistic content space and deep neural network based feature extraction, to directly con-vert speaker characteristics, and improve the flexibility of conversion.The whole dissertation is organized as follow:Chapter1is the introduction. It briefly introduces the research area and signifi-cance of voice conversion, and reviews the history and state-of-art of this research area.Chapter2first introduces the factors that affect the speaker characteristics in speech signal, based on which we introduces the GMM based voice conversion method, includ-ing the fundamental principles of GMM, the system framework, some key techniques in the system and some typical spectral conversion methods. Based on some analysis of the characteristics of these methods, the motivation of our research work is declared.Chapter3introduces an improved model for joint spectral space modeling, ac-cording to the lack of direct modeling on mapping relations from source speaker and target speaker in conventional GMM, we proposes some explicit linear transformations to model these mapping relations and constraint the probabilistic distributions of joint space. And this model is extended to the training on non-parallel data to improve the flexibility of the model.Chapter4introduces two methods to separately model the speaker characteristic and linguistic content in speech signal. In the first method, we introduces a speaker independent (SI) model to describe the linguistic content space that commonly lying in speaker dependent (SD) acoustic space, and uses the mapping relations from SI space to SD space to describe the speaker characteristics. In the second method, we adopts the deep neural network (DNN) to extracted speaker specific and content specific features. The conversion flexibility is improved by converting the speaker specific information extracted by these methods.Chapter5introduces a method to use RBMs to model the probabilistic distribu-tions of joint space, and use these models to directly model and converted the original spectral envelopes. After briefly reviews the insufficiency of Gaussian based models in modeling ability, we proposes to adopt RBMs instead of Gaussians to model the distribution of the space of each mixture under the GMM framework, and derived con-version functions from RBMs. We show the significant improvement in both similarity and voice quality of the converted speech.Chapter6concludes the whole dissertation.
Keywords/Search Tags:voice conversion, Gaussian mixture model, joint space modeling, deepneural network, restricted Boltzmann machine
PDF Full Text Request
Related items