Font Size: a A A

Neural Network Based Voice Conversion

Posted on:2015-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:F L XieFull Text:PDF
GTID:2298330422990920Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Neural network (NN) based voice conversion, which employs a nonlinear functionto map the features from a source to a target speaker, has been shown to outperformGMM-based voice conversion method. However, there are still limitations to be over-come in NN-based voice conversion, e.g. NN is trained on a Frame Error (FE) minimiza-tion criterion and the corresponding weights are adjusted to minimize the error squaresover the whole source-target, stereo training data set. In this paper, we use the idea of sen-tence optimization based, minimum generation error (MGE) training in HMM-based TTSsynthesis, and modify the FE minimization to Sequence Error (SE) minimization in NNtraining for voice conversion. The conversion error over a training sentence from a sourcespeaker to a target speaker is minimized via a gradient descent-based, back propagation(BP) procedure. Experimental results show that the speech converted by the NN, whichis first trained with frame error minimization and then refined with sequence error mini-mization, sounds subjectively better than the converted speech by NN trained with frameerror minimization only. Scores on both naturalness and similarity to the target speakerare improved. In voice conversion task, prosody conversion especially pitch conversionis also a very challenging research topic because of the discontinuity property of pitch.Conventionally pitch conversion is always achieved by adjusting the mean and varianceof the source pitch distribution to the target pitch distribution. This method removes mostof the detailed information of the speaker prosody and only maintains the F0contour. Inthis paper, we propose a neural network based pitch conversion system which converts F0and spectral features all together frame by frame. Experimental results show that neuralnetwork based pitch conversion can significantly reduces the Unvoiced/Voiced error andRMSE of F0between converted pitch and target pitch compared with the convention-al Gaussian normalized transformation method. And wavelet decomposition for F0canfurther improve the conversion performance.
Keywords/Search Tags:voice conversion, neural network, pre-training, sequence error minimization, pitch conversion, wavelet decomposition
PDF Full Text Request
Related items