Non-parallel Many-to-many Voice Conversion Method Based On PSR-STARGAN

Posted on:2021-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:D X Xu

Full Text:PDF

GTID:2428330614963741

Subject:Signal and Information Processing

Abstract/Summary:

Voice conversion is a task that converts personality characteristics while preserving semantic information,meaning that the converted speech has the same semantic information as the source speech and personality characteristics is transformed.Traditional voice conversion methods require parallel data to train the voice conversion model,but it is difficult to obtain parallel data in real scenarios.In order to break through this limitation,many non-parallel voice conversion methods have been proposed,among which the non-parallel voice conversion method based on generative adversarial networks has become the current mainstream method,but there are still problems of low naturalness and poor similarity of the converted speech.This paper focuses on the voice conversion model based on STARGAN,analyzes the overall structure of the model and proposes a series of improvements.First,in order to improve the quality of the converted speech,this paper proposes a voice conversion method based on SR-STARGAN.On the one hand,ResNet can be used to solve the degradation problem of deep neural network,so it is applied to the voice conversion model based on STARGAN-VC by establishing a residual network betweeen the Encoder and Decoder of generator to reduce the difficulty of model learning,the quality of the converted speech is improved.On the other hand,STARGAN-VC is designated as the batch normalization.However,the data normalization method may cause the performance of the model to decline,so this paper proposes to use switchable normalization to normalize the data of each layer in the model,instead of the original assigned batch normalization.SN can learn to select different normalizers for different normalization layers of a deep neural network and learn their importance weights in an end-to-end manner,so the proposed method can obtain optimal performance.Subjective and objective evaluations show that the average MCD value of the proposed method is reduced by 6.96%,the average MOS value is increased by 9.34%,and the average ABX value is increased by 5.48% compared to the STARGAN-VC.The method improves the voice quality and personality similarity.Further,on the basis of the above improved model,this paper proposes a voice conversion method based on PSR-STARGAN.In order to effectively preserve details of spectrum and improve the naturalness and similarity of the converted speech,perceptual network is used to extract perceptual loss that can measure the difference between the source and the target speech spectrum in high dimension,thereby improving the conversion effect of the model,enhancing the ability of model to reproduce details of spectrum,and making the converted speech more natural.Subjective and objective evaluations show that the average MCD value of the proposed method is reduced by 9.36%,the average MOS is increased by 19.29%,and the average ABX value is increased by 6.32%.The method greatly improves the voice quality,and also improves personality similarity.

Keywords/Search Tags:

voice conversion, Generative Adversarial Network, WORLD, ResNet, switchable normalization, perceptual network

Related items

1	StyleGAN Voice Conversion Combining DSNet And ESR Network
2	Many-to-Many Voice Conversion Algorithm Based On Dense Net Star Generative Adversarial Network Combining I-vector For Non-parallel Corpora
3	Residual Generative Adversarial Network Algorithm Research
4	Non-parallel Many-to-many Voice Conversion Based On Dynamic Convolution StyleGAN
5	Research On Many-to-Many Voice Conversion Based On I-vector,Variational Auto-encoder And Generative Adversarial Networks For Non-parallel Corpora
6	Non-parallel Many-to-Many Voice Conversion Based On SE-ResNet Combining Speaker Embedding
7	A New Lipschitz Generative Adversarial Network And Its Application In Voice Conversion
8	Research On Image Style Conversion Method Based On Generative Adversarial Network
9	Research On Zero-Shot Voice Conversion With Generative Adversarial Networks
10	Non-parallel Voice Conversion Using ACGAN And Variational Autoencoders Conditioned By Sentence Embedding