Font Size: a A A

Research On Any-to-many Voice Conversion Based On Non-parallel Data

Posted on:2022-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z C YangFull Text:PDF
GTID:2518306569465744Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Voice Conversion(VC)aims at reconstructing the speech using new speaker information while preserving the linguistic information.It is an important branch in the field of speech synthesis and one of the hot researches in voice interaction.It has a very broad application prospect in the fields of medical treatment,virtual live broadcast,and anti-fraud.In recent years,voice conversion based on deep learning has made great progress.The intra-lingual voice conversion has been able to achieve high naturalness and similarity.However,how to effectively disentangle speaker information and linguistic information,and alleviate the problem of cross-linguistic domain mismatch under non-parallel corpus remains a key technical issue in VC.To solve these problems,we conduct the study of Voice Conversion based on deep learning.The main research work of this paper is as follows:(1)Aiming at how to effectively disentangle speaker information and linguistic information,this paper proposes an any-to-many voice conversion based on phoneme embedding.Most of voice conversion systems based on the phonetic posterior grams(PPGs)can not balance naturalness and similarity on low-resource data.Our algorithm can use the linguistic representation of disentangled phoneme embedding to replace PPGs.Combined with speaker embedding,we use pitch self-supervision to constrain the converted speech of the target speaker,and uses multi-step output and random learning strategies to improve the context information and generalization ability of voice conversion system.Experimental results show that our proposed model can achieve better performance with mel-cepstral distortion,,word error rate and subjective evaluation.(2)In terms of cross-language domain mismatch under non-parallel corpus,this paper proposes a cross-language voice conversion algorithm based on time-frequency feature enhancement and speaker domain adversarial training.Most of cross-lingual voice conversion algorithms are still not enough to adapt to the speaker differences caused by language mismatch,especially when the target language does not appear in the training phase,resulting in the loss of linguistic or wrong pronunciation.The cross-lingual voice conversion proposed in this paper can use the mix-language phoneme recognition model to extract universal linguistic representations(ULRs),and use speaker domain adversarial to better remove the speaker information.It also use speaker standardization to reconstruct the target speech more efficiently.We also design a useful multi-scale time-frequency enhancement to denoise speech background noise.Experimental results demonstrate that our algorithm can achieve high naturalness and similarity of cross-lingual voice conversion among different languages.The research work on proposed voice conversion is verified objective and subjective,which has great application value.
Keywords/Search Tags:voice conversion, cross-lingual, disentangled universal linguistic representation, speaker domain adversarial
PDF Full Text Request
Related items