Research On Voice Conversion System Based On Vector Quantized Variational Autoencoder

Posted on:2022-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:X Kang

Full Text:PDF

GTID:2518306539998269

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Voice conversion(VC)aims to change the speech of a source speaker to the speech of a target speaker,and make the converted speech similar to the speech of the target speaker without changing the linguistic information.To imitate the target speaker,the VC system needs to modify the timbre information of source speaker to make the converted speech more similar to the speech of the target speaker.In the VC thought of style transfer,different speakers are considered to be different style domains.During the VC process,the pronunciation content of the voice remains unchanged,but the style is transferred.In the field of style transfer,many models are based on the Encoder-Decoder framework to disentangle content and speaker representations from acoustic information,and then concatenate the content representations with the new speaker representations to achieve VC.Based on the idea of disentangling content and speaker information,this paper proposes two methods of introducing Connectionist Temporal Classification(CTC)supervision and Self-supervised Audio Transformers(SAT)model into the VC system based on VQ-VAE(Vector Quantized Variational Autoencoder).VQ-VAE based VC can disentangle content and speaker representations from the speech by using content and speaker encoders,respectively.However,the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features owing to content encoder lacking linguistic supervision.To address this issue,under the framework of VQ-VAE,connectionist temporal classification(CTC)loss is proposed to guide content encoder to learn pure content representations by an auxiliary network.Due to the CTC loss is not affected by sequence length of the output of the content encoder,it can easily add linguistic supervision to the content encoder,making modeling easier.Different from state-of-the-art VC systems that normally make use of Log Mel-spectrogram(LMS)as the input,we propose high-level acoustic features extracted from SAT based model trained by using LMS as the input,namely,SAT-LMS,which aims at extracting high-level acoustic information for VC.Then the proposed SAT-LMS is used as the input for the parallel VC and non-parallel VC systems,respectively.We use parallel corpus and non-parallel corpus databases to conduct multiple experiments on the two proposed methods in this paper.The results of objective and subjective evaluations show that the proposed methods perform well in both speech quality and speaker similarity,and confirm the effectiveness of our proposed methods.

Keywords/Search Tags:

Voice Conversion, Vector Quantized Variational Autoencoder, Linguistic Supervision, Connectionist Temporal Classification, Self-supervised Audio Transformers

PDF Full Text Request

Related items

1	Research On Many To Many Voice Conversion Based On I-vector And Improved Variational Autoencoder For Non-parallel Corpora
2	Research Of Semi-supervised Aspect-level Sentiment Classification Based On Variational Autoencoder
3	Research On Many-to-Many Voice Conversion Based On I-vector,Variational Auto-encoder And Generative Adversarial Networks For Non-parallel Corpora
4	Research On Connectionist Temporal Classification In Speech Recognition
5	Fast One-shot Cross-lingual Voice Conversion Based On Dual Encoders
6	Research On Key Issues Of Audio Event Detection And Classification For Complex Audio Documents
7	Research On Any-to-many Voice Conversion Based On Non-parallel Data
8	Non-parallel Voice Conversion Using ACGAN And Variational Autoencoders Conditioned By Sentence Embedding
9	Study Of Radar Target Recognition Based On Supervised Variational Autoencoder
10	High-quality Voice Conversion From Non-parallel Corpora Based On Variational Auto-encoder And Bottleneck Feature