Font Size: a A A

Non-parallel Many-to-many Voice Conversion Method Based On PSR-STARGAN

Posted on:2021-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:D X XuFull Text:PDF
GTID:2428330614963741Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Voice conversion is a task that converts personality characteristics while preserving semantic information,meaning that the converted speech has the same semantic information as the source speech and personality characteristics is transformed.Traditional voice conversion methods require parallel data to train the voice conversion model,but it is difficult to obtain parallel data in real scenarios.In order to break through this limitation,many non-parallel voice conversion methods have been proposed,among which the non-parallel voice conversion method based on generative adversarial networks has become the current mainstream method,but there are still problems of low naturalness and poor similarity of the converted speech.This paper focuses on the voice conversion model based on STARGAN,analyzes the overall structure of the model and proposes a series of improvements.First,in order to improve the quality of the converted speech,this paper proposes a voice conversion method based on SR-STARGAN.On the one hand,ResNet can be used to solve the degradation problem of deep neural network,so it is applied to the voice conversion model based on STARGAN-VC by establishing a residual network betweeen the Encoder and Decoder of generator to reduce the difficulty of model learning,the quality of the converted speech is improved.On the other hand,STARGAN-VC is designated as the batch normalization.However,the data normalization method may cause the performance of the model to decline,so this paper proposes to use switchable normalization to normalize the data of each layer in the model,instead of the original assigned batch normalization.SN can learn to select different normalizers for different normalization layers of a deep neural network and learn their importance weights in an end-to-end manner,so the proposed method can obtain optimal performance.Subjective and objective evaluations show that the average MCD value of the proposed method is reduced by 6.96%,the average MOS value is increased by 9.34%,and the average ABX value is increased by 5.48% compared to the STARGAN-VC.The method improves the voice quality and personality similarity.Further,on the basis of the above improved model,this paper proposes a voice conversion method based on PSR-STARGAN.In order to effectively preserve details of spectrum and improve the naturalness and similarity of the converted speech,perceptual network is used to extract perceptual loss that can measure the difference between the source and the target speech spectrum in high dimension,thereby improving the conversion effect of the model,enhancing the ability of model to reproduce details of spectrum,and making the converted speech more natural.Subjective and objective evaluations show that the average MCD value of the proposed method is reduced by 9.36%,the average MOS is increased by 19.29%,and the average ABX value is increased by 6.32%.The method greatly improves the voice quality,and also improves personality similarity.
Keywords/Search Tags:voice conversion, Generative Adversarial Network, WORLD, ResNet, switchable normalization, perceptual network
PDF Full Text Request
Related items