Font Size: a A A

Non-parallel Voice Conversion Using ACGAN And Variational Autoencoders Conditioned By Sentence Embedding

Posted on:2020-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShiFull Text:PDF
GTID:2428330590995535Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Voice conversion is a technique to transform the speaker identity included in a source speech into a different one included in a target speech while preserving linguistic information of the source speech.The thesis overcomes the over-regularization issue in latent variables of the VAWGAN voice conversion model by introducing sentence embedding and Text-encoder,and the structure of GAN(generative adversarial network)has been improved by introducing the auxiliary classifier GAN(ACGAN),improving the speech quality and speaker similarity of the converted speech.Firstly,this thesis applies sentence embedding trained by Text-encoder to the voice conversion model based on VAE and WGAN.The semantic information contained in sentence embedding can improve the speech quality and speaker similarity of the converted speech.Subjective and objective evaluations reveal that the average value of MCD(Mel-Cepstral Distortion)of the converted speech decreases by 4.39%,the average value of MOS(Mean Opinion Score)increases by 4.46% and the average value of ABX increases by 6.70% compared with the voice conversion model based on VAE and WGAN.The results indicate that the proposed method has a great improvement in speech quality and similarity.Secondly,the thesis replaces the Wasserstein generative adversarial network in the voice conversion model based on VAE and WGAN by ACGAN which has better generation performance.ACGAN uses the category label of the feature sample as auxiliary information,whose discriminator can not only predict the true and false of the sample,but also predict the category of the sample.The subjective and objective evaluations show that ACGAN works well in the voice conversion,and the average value of MCD of the converted speech decreases by 5.98%,the average value of MOS increases by 6.85% and the average value of ABX increases by 8.50% compared with the voice conversion model based on VAE and WGAN,indicating that this method has a great improvement in speech quality and similarity.
Keywords/Search Tags:voice conversion, variational auto-encoder, generative adversarial network, WORLD model, non-parallel corpora, many-to-many conversion, Text-Encoder, sentence embedding
PDF Full Text Request
Related items