Non-parallel Voice Conversion Using ACGAN And Variational Autoencoders Conditioned By Sentence Embedding

Posted on:2020-09-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Shi

Full Text:PDF

GTID:2428330590995535

Subject:Signal and Information Processing

Abstract/Summary:

Voice conversion is a technique to transform the speaker identity included in a source speech into a different one included in a target speech while preserving linguistic information of the source speech.The thesis overcomes the over-regularization issue in latent variables of the VAWGAN voice conversion model by introducing sentence embedding and Text-encoder,and the structure of GAN(generative adversarial network)has been improved by introducing the auxiliary classifier GAN(ACGAN),improving the speech quality and speaker similarity of the converted speech.Firstly,this thesis applies sentence embedding trained by Text-encoder to the voice conversion model based on VAE and WGAN.The semantic information contained in sentence embedding can improve the speech quality and speaker similarity of the converted speech.Subjective and objective evaluations reveal that the average value of MCD(Mel-Cepstral Distortion)of the converted speech decreases by 4.39%,the average value of MOS(Mean Opinion Score)increases by 4.46% and the average value of ABX increases by 6.70% compared with the voice conversion model based on VAE and WGAN.The results indicate that the proposed method has a great improvement in speech quality and similarity.Secondly,the thesis replaces the Wasserstein generative adversarial network in the voice conversion model based on VAE and WGAN by ACGAN which has better generation performance.ACGAN uses the category label of the feature sample as auxiliary information,whose discriminator can not only predict the true and false of the sample,but also predict the category of the sample.The subjective and objective evaluations show that ACGAN works well in the voice conversion,and the average value of MCD of the converted speech decreases by 5.98%,the average value of MOS increases by 6.85% and the average value of ABX increases by 8.50% compared with the voice conversion model based on VAE and WGAN,indicating that this method has a great improvement in speech quality and similarity.

Keywords/Search Tags:

voice conversion, variational auto-encoder, generative adversarial network, WORLD model, non-parallel corpora, many-to-many conversion, Text-Encoder, sentence embedding

Related items

1	Research On Many-to-Many Voice Conversion Based On I-vector,Variational Auto-encoder And Generative Adversarial Networks For Non-parallel Corpora
2	Research On Many To Many Voice Conversion Based On I-vector And Improved Variational Autoencoder For Non-parallel Corpora
3	High-quality Voice Conversion From Non-parallel Corpora Based On Variational Auto-encoder And Bottleneck Feature
4	Many-to-Many Voice Conversion Algorithm Based On Dense Net Star Generative Adversarial Network Combining I-vector For Non-parallel Corpora
5	Research On Any-to-any Emotional Voice Conversion Based On Variational Auto-encoder
6	StyleGAN Voice Conversion Combining DSNet And ESR Network
7	Non-parallel Many-to-Many Voice Conversion Based On SE-ResNet Combining Speaker Embedding
8	Non-parallel Many-to-many Voice Conversion Method Based On PSR-STARGAN
9	Research Image Style Transfer Algorithm Based On Generative Adversarial Networks
10	Research On Speech Conversion Algorithms Based On Deep Convolutional Auto Encoder