Font Size: a A A

End-to-End Speech Synthesis Method For Tibetan Amdo Dialect

Posted on:2024-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:M LuoFull Text:PDF
GTID:2555307124454254Subject:Engineering
Abstract/Summary:PDF Full Text Request
The development of deep learning has opened a new path for the research of speech synthesis.Traditional speech synthesis techniques are limited in their implementation and design,because they rely on complex processes and require high expertise in linguistics and audio fields.Deep learning-based speech synthesis technology simplifies the complex process of traditional speech synthesis methods,reduces the difficulty of synthesis,and surpasses traditional speech synthesis technology in terms of quality and speed of speech synthesis.As a multi-ethnic country,China is rich in ethnic languages and local dialects.Tibetan is one of the ethnic groups in China,and there are many minority people who use Tibetan as their native language,and with the continuous maturity of speech synthesis technology,the requirements for speech synthesis of ethnic languages are further improved with the development of deep learning.The study of Tibetan speech synthesis can not only protect the Tibetan ethnic dialect,but also help to realize the aspiration of intelligent devices in Tibetan areas,and it is also extremely important for the education and culture of Tibetan areas.At present,there is less research on speech synthesis for Tibetan Amdo dialect,and the synthesized speech has much room for improvement in terms of robustness,controllability of speech rate,and personalized speech synthesis.To address the above shortcomings,this thesis proposes autoregressive and non-autoregressive methods to explore the speech synthesis of Tibetan Amdo dialect and the personalized speech synthesis based on the improved model using the deep learning end-to-end framework.1.A speech synthesis corpus for Tibetan Amdo dialect was established.In this thesis,we collected and organized the Tibetan Amdo dialect corpus resources,and further expanded and improved them,and finally built a Tibetan Amdo dialect corpus containing 19288 sentences with 36 speakers,which basically covers all the pronunciation characteristics of Amdo dialect;2.Three end-to-end speech synthesis schemes for Tibetan Amdo dialect were designed,namely Tacotron and Tacotron2 models based on the autoregressive method with improvement and optimization,and FastSpeech model based on the non-autoregressive method with improvement and optimization.The experimental results show that the non-autoregressive model can solve the shortcomings brought by the autoregressive model,such as arbitrary word skipping,word omission and uncontrollable speech speed,and improve the quality of Tibetan speech synthesis to a certain extent;3.propose a personalized speech synthesis scheme for Tibetan Amdo dialect based on speaker encoding,use the improved autoregressive speech synthesis Tacotron2 as a synthesizer and add the target speaker audio in the encoding stage,the speaker encoder extracts the target speaker voice features and then enters the synthesizer to synthesize,and finally a speech audio with the voice features of this target speaker can be obtained for any text content.Experimental results show that the speech synthesized using this scheme has some similarity with the original audio.
Keywords/Search Tags:Tibetan Amdo Dialect, Non-autoregressive, Personalized Speech Synthesis, End-to-End Models
PDF Full Text Request
Related items