Font Size: a A A

Research On Voice Conversion System Based On Vector Quantized Variational Autoencoder

Posted on:2022-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:X KangFull Text:PDF
GTID:2518306539998269Subject:Engineering
Abstract/Summary:PDF Full Text Request
Voice conversion(VC)aims to change the speech of a source speaker to the speech of a target speaker,and make the converted speech similar to the speech of the target speaker without changing the linguistic information.To imitate the target speaker,the VC system needs to modify the timbre information of source speaker to make the converted speech more similar to the speech of the target speaker.In the VC thought of style transfer,different speakers are considered to be different style domains.During the VC process,the pronunciation content of the voice remains unchanged,but the style is transferred.In the field of style transfer,many models are based on the Encoder-Decoder framework to disentangle content and speaker representations from acoustic information,and then concatenate the content representations with the new speaker representations to achieve VC.Based on the idea of disentangling content and speaker information,this paper proposes two methods of introducing Connectionist Temporal Classification(CTC)supervision and Self-supervised Audio Transformers(SAT)model into the VC system based on VQ-VAE(Vector Quantized Variational Autoencoder).VQ-VAE based VC can disentangle content and speaker representations from the speech by using content and speaker encoders,respectively.However,the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features owing to content encoder lacking linguistic supervision.To address this issue,under the framework of VQ-VAE,connectionist temporal classification(CTC)loss is proposed to guide content encoder to learn pure content representations by an auxiliary network.Due to the CTC loss is not affected by sequence length of the output of the content encoder,it can easily add linguistic supervision to the content encoder,making modeling easier.Different from state-of-the-art VC systems that normally make use of Log Mel-spectrogram(LMS)as the input,we propose high-level acoustic features extracted from SAT based model trained by using LMS as the input,namely,SAT-LMS,which aims at extracting high-level acoustic information for VC.Then the proposed SAT-LMS is used as the input for the parallel VC and non-parallel VC systems,respectively.We use parallel corpus and non-parallel corpus databases to conduct multiple experiments on the two proposed methods in this paper.The results of objective and subjective evaluations show that the proposed methods perform well in both speech quality and speaker similarity,and confirm the effectiveness of our proposed methods.
Keywords/Search Tags:Voice Conversion, Vector Quantized Variational Autoencoder, Linguistic Supervision, Connectionist Temporal Classification, Self-supervised Audio Transformers
PDF Full Text Request
Related items