Font Size: a A A

Study On The Neural Network Modelling Method For Voice Conversion

Posted on:2016-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiuFull Text:PDF
GTID:2308330470457756Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Voice conversion is a technique that changes the speech characteristics of source speaker in order to make it sounded like that of the target speaker, while keeping the linguistic infromation unchanged. It is a relative new brand in the field of speech siginal processing. Studying on this technique can not only promote the research process of speech coding, speech synthesis, speech enhancement, speech recognition, and so on. It can also be applied in many application scenarios, such as multi-media entertainments, medical, secure communication. GMM-based method is the mainstream conversion method in voice conversion. It owns the advantage of high converted speech similarity and good conversion robustness. However, the speech quality is degrated with the effect of over-smoothing problem. As the obtained conversion model is source and target speaker dependent, new models have to be trained for new conversion pairs, which makes the use of this method inflexible.This dissertation concentrates on alleviating the over-smoothing problem in the GMM-based methods and putting forward new conversion methods that can realize flexible conversion. The over-smoothing problem in GMM-based methods mainly comes from two reasons:1) the use of high-level spectral features which are extracted from the raw spectra of speech. Some detailed characteristics on the raw spectra are lost during the extraction process;2)the inadequate modeling ability of GMM to build the non-linear relationship between the spectral feature vectors of source and target speaker, as it can only construct linear mapping relationship. Considering it is difficult for GMM to model spectral envelope, we propose a new method by using Gaussian bidirectional associative memories (GBAMs) that can model the joint distribution of spectral envelop of source and target speakers, and the converted speech naturalness and similarity are improved. As the mapping relationship built by GBAM is still linear, we further pro-pose a generative trained deep neural network (GTDNN) based on restricted Boltzmann machine (RBM) and Bernoulli BAM (BAM). GTDNN can construct the non-linear re-lationship between the spectral envelopes of source and target speaker, thus the con-version performance gets further improved. In addition, a DNN trained with multiple source speakers is proposed that can perform flexible conversion. The obtained DNN can be regarded as a source-speaker-independent model that can conduct conversions from arbitrary source speaker to certain target speaker directly, which makes the realiza-tion of conversions for new speakers more convenient. The experimental results show that it can achieve comparable performance to conventional GMM-based method. What is more, it can also serve as an initialization model for the source and target speaker de- pendent DNN model training, which outperforms the conventional initialization method with deep belief network (DBN).
Keywords/Search Tags:voice conversion, spectral envelope conversion, bidirectional associativememory, restricted Boltzmann machine, deep neural network, source-speaker-independentmodel
PDF Full Text Request
Related items