Font Size: a A A

Research On Technology Of Voice Conversion

Posted on:2017-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:B LuFull Text:PDF
GTID:2308330485486146Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Voice conversion(VC) is a technique that manipulates one speaker’s(source) voice timbre and/or prosody by algorithm automatically to make it sound like another speaker(target) said and keep the language content unchanged at the same time. After reaseach on the fundamentals about VC, this thesis proposes a new voice conversion technique based on sparse representation nonnegative matrix factorization(SNMF).Then comparing this technique with a state-of-the-art baseline VC method which is based on maximum likelihood Gaussian mixture model(ML-GMM) on the parallel corpus called CMU ARCTIC, it proves that the proposed method equals the ML-GMM one in subjective listening test. What’s more, under the limit training data situation, the proposed SNMF VC technique’s speaker identification rate is more than 72%,while the ML-GMM’s is less than 28%. At the same time, in subjective mean opinion score(MOS) test,the SNMF performs 2.6 better than ML-GMM dose 1.8. Through the comparison, it turns out that SNMF has a better subjective listening performace and robustness.This thesis proposes two improvements about SNMF VC to make its performace better and reduce the spectrum distortion further. Firstly, considering the complexity and variance of speech signal, this thesis introduces kmeans clustering algorithm into the SNMF VC system to enhance the ability of NMF to dig the latent features of speech signal.This method is called kmeansSNMF, which dose kmeans clustering to all the training data to make it cluster into k different clusters first and then do SNMF VC in each cluster respectively. The experiment result indicates that this improved method reduces the spectrum distortion of SNMF vastly and makes SNMF VC technique more effective to use the large amount of training data.Secondly, in terms of the importance of the inter-frame information, this thesis brings in the combined frame to make three or more frames together into a large frame, which introduce the inter-farme information into kmeansSNMF. And it turns out that the new method makes the spectrum distortion lower, improves the naturality for auditory sense and has a better subjective listening performace than the classical ML-GMM method, which the MOS of former is 3.78 while the latter’s is 3.70.At last, enlightened by the SNMF voice conversion technique, this thesis applys a method called joint nonnegative matrix factorization to make a factorization for two or more training data matrixes simultaneously with only one fixed activation matrix. Then based on this mothed, this thesis proposes a cross-voice conversion system to make one-to-many and many-to-one voice conversion rather than the conventional one-to-one(source-to-target) voice conversion.
Keywords/Search Tags:voice conversion, GMM, NMF, kmeans, cross-voice conversion
PDF Full Text Request
Related items