Font Size: a A A

Research On Many-to-Many Voice Conversion Based On Multi-Scale StarGAN By Share-Learning For Non-parallel Corpora

Posted on:2021-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:H ShaFull Text:PDF
GTID:2428330614465879Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The goal of voice conversion is to convert the source speaker's voice so that it sounds like the voice of the target speaker,and the semantics remain unchanged.Voice conversion can be divided into voice conversion of parallel corpus and non-parallel corpus.The difference lies in whether the voice content and duration of the source speaker and the target speaker in the corpus used for training are the same.However,in practical application,it is very difficult to obtain a large amount of parallel corpus,and in some cases it is not achievable,so it is very necessary to study the voice conversion of non-parallel corpus.The performance evaluation of voice conversion mainly includes two aspects: the quality and similarity of converted voice.Existing non-parallel speech conversion models have difficulty in achieving good performance in both aspects.This paper focuses on the StarGAN voice conversion model,and proposes a series of improvements in the above two aspects.First,in order to improve the sound quality of the converted voice and make it sound more realistic and delicate,this paper uses the Multi-Scale structure to improve the baseline system,and proposes a voice conversion method based on Multi-Scale StarGAN to extract different levels of the target speaker 's global features.Multi-scale features enhance the details of the converted voice.Through subjective and objective experiments,it is verified that the performance of the time-domain waveform of the converted voice based on improved voice conversion model is smoother,which is closer to the voice of target speaker,,and the spectrogram is also clearer,and the average MOS is increased by 21.8%,the average ABX is increased by 5.56% compared with the StarGAN-based voice conversion model.The results show that this method can effectively improve the synthesized voice quality while improving the voice similarity.Secondly,considering that StarGAN trains the generator to realize voice conversion by training the discriminator and classifier,so by using Share-Learning strategy to train the the shared module of discriminator and classifier which is named Share-Block,so that we can improve the performance of the discriminator and classifier,improving the stability of training,accelerating the convergence speed and improving the sound quality and similarity of synthesized voice.Efficient subjective and objective comparisons show that,,the average MOS is increased by 15.79%,and the average ABX is increased by 2.38% compared with the StarGAN-based voice conversion model.Furthermore,combining the two innovations in this paper,Share-Learning is added to the Multi-Scale StarGAN method,and a voice conversion method based on Multi-Scale StarGAN using Share Learning is proposed.Subjective and objective evaluation shows that compared with the converted voice by the Multi-Scale StarGAN method,the time-domain waveform of the converted voice is smoother and closer to the voice of target speaker.The spectrogram of the converted voice is clearer.The average MOS is increased by 3.57% and the average ABX value is increased by 3.30%,indicating that this method has greatly improved the voice quality and the speaker's personality similarity.Compared with the voice conversion model based on StarGAN,the average MOS is increased by 28.95%,and the average ABX is increased by 9.03%.Full experimental results show that this method improves voice quality while improving voice similarity effectively.
Keywords/Search Tags:Voice Conversion, GAN, StarGAN, Multi-Scale, Share-learning, Non-parallel Corpora
PDF Full Text Request
Related items