Font Size: a A A

Nonparallel-Corpus-Based Multi Speaker Voice Conversion

Posted on:2021-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:H P LinFull Text:PDF
GTID:2428330611999756Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology and the breakthrough of speech processing technology,speech processing plays an important role in life,such as voice assistant in the car system,voiceprint recognition in the security system.It provides great convenience for people's life.And voice conversion is a pivotal role in speech synthesis which aims to convert the non-linguistic information contained in a given utterance with keeping the content information.The promotion of traditional parallel-corpus-based voice conversion suffers from some problems,the high cost of collecting corpus,introducing noise during the dynamic time wrapping and the incapableness of efficiently building multi-speaker voice conversion.So,unparallel-corpus-based multi-speaker voice conversion has received growing research attention over recent years.Compared with autoencoder-based voice conversion,Star GAN-based voice conversion can explicit model the conversion process without parallel corpus and model multi-speaker voice conversion with utilizing the conditional information intra domain transformation.Based on the Star GAN-based voice conversion model,the research work of this paper mainly includes:As to tackle the inability to model multi-speaker voice conversion in the raw ACGAN-based voice conversion model,we propose two multi-speaker voice conversion frameworks.One is based on the auxiliary classifier generative adversial net with an extra speaker adversial loss.The other is based on activation maximization and spectral normalization generative adversial net.By introducing the game playing in speaker tone in the generative adversarial network,the discriminator can capture the speaker tone information so that the model can tackle multi speaker voice conversion.We conduct the experiments on unparallel-based multi speaker corpus.Our proposed model shows better performance than the current SOTA model,AUTOVC,which demonstrates the importance of introducing game playing in speaker tone information in multi speaker voice conversion GAN framework.In this thesis,aiming to improve the converted speech's similarity and the quality,we propose a voice conversion method with self-attention mechanism and knowledge transfer.As to improve the converted speech's similarity,we introduce the one dimensional self-attention mechanism in the conversion model to capture the hierarchical structure in the frequency domain.It can enable the model observe each frequency component and make a transformation,which can help enhance the details of the generated sample.In addition,inspiring by the language model in natural language processing and the usage of pretrained model in the application of computer vision,we adopt the voice print vector with the conversion model as a more accurate speaker representation.It can help reduce overfitting and improve the model's robustness.In the consideration of the generated audio's quality and the necessity of using a transferable vocoder in voice conversion,we discuss the portability of the Wave Glow vocoder and validate it in the experiment.We integrate the same Wave Glow vocoder with both our proposed English-Corpus-based and Chinese-Corpus-based conversion model,that the vocoder reconstruct the converted mel-spectrogram to speech,which can help improve the quality of converted speech.
Keywords/Search Tags:voice conversion, generative adversial net, self-attention, voiceprint embedding, transferable vocoder
PDF Full Text Request
Related items