Font Size: a A A

Cross-language Speech Synthesis Based On Deep Learning

Posted on:2024-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:G L ZhaoFull Text:PDF
GTID:2545307079473464Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Currently,conventional speech synthesis has been developed very maturely,especially in the field of timbre,rhythm,quality,and emotion of the generated speech which has an excellent performance in various fields due to its rapid application in industry.However,as applications in the industry continue to increase,along with growing diversified needs,there is a need for applications in which among a large number of popular speaker images,there exists only a monolingual speaker,which in turn limits its advent and expansion in the global market.Therefore,a model that can be trained to synthesize multilingual voices based on the monolingual speaker is urgently needed.This thesis addresses the problem by proposing a cross-lingual speech synthesis model based on end-to-end conditional variational self-coding of adversarial generative speech models using an open-source Chinese,English and Japanese dataset.It is demonstrated experimentally that the cross-lingual speech synthesis model proposed in this thesis has a significant improvement in the naturalness of speech synthesis compared with the baseline model.Thesis includes the following aspects.First,we propose an end-to-end cross-lingual speech synthesis model based on an adversarial speech generation model with an end-to-end conditional variational self-encoding.In terms of the model,a gradient inversion-based domain adaptor is introduced by adding a multilingual embedding vector with a regular constrained loss term for the multi-speaker embedding vector,which is used to address insufficient decoupling of speaking languages and speakers in cross-lingual speech synthesis.In addition,the barrel sorting module in the original model is also removed,and the random time predictor is modified to be a duration one,which is used to cope with unstable rhythms and high speech speed.Moreover,the effect of different phonemes and rhyme representations on the synthesized speech of the cross-lingual model is explored in the phoneme front-end of the model.The experiment has demonstrated that a fully differentiated front-end phoneme and rhyme representation yields the best results of speech synthesis.Finally,large-scale applications in cross-lingual speech synthesis are explored and the amount of data can be determined through data training and parameters freezing.
Keywords/Search Tags:Speech Synthesis, Cross-lingual, Deep Learning, Domain Adaptation
PDF Full Text Request
Related items