Font Size: a A A

Voice Conversion Using Deep Belief Network In Super-frame Feature Space

Posted on:2017-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:W YeFull Text:PDF
GTID:2308330488962079Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Voice conversion(VC) is a technique that modifies the characteristics of the source speaker’s voice to make it sound like the target speaker’s voice while keeping the voice content unchanged. The spectral conversion based on Gaussian mixture model(GMM) is widely used as it can get better performance. However, the traditional GMM transformation deals with every feature vector independently of its previous and next frames, the converted speech will have certain discontinuity. In addition, the GMM-based VC system was also affected with over-smoothing, which is produced by the limited capability of the traditional GMM-based conversion functions to capture the source-target correspondence for a given parametric representation of speech.This paper presents a voice conversion technique using deep belief nets to map the spectral envelopes of a source speaker to that of a target speaker. Short-time spectral envelopes are represented by the linear predication cepstrum coefficients(LPCC) parameters which are derived by STRAIGHT, then using dynamic time warping to align the parameters between the source speaker and the target ones. The neighbor frames of the source speaker are gathered to form super-frames to serve as the input data of the network, and the corresponding frame of the target speaker is used as the output data of the network. The spectral conversion function is derived by training the neural network. ABX and MOS evaluations indicate that the conversion performance based on the presented method is better than the traditional GMM method under the parallel corpora condition.
Keywords/Search Tags:Voice conversion, DBN, Short-time spectrum deep feature, Super-frames
PDF Full Text Request
Related items