Font Size: a A A

Research On Speech Synthesis Technology For Tibetan Lhasa Based On Fully End-to-End Method

Posted on:2024-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z H SongFull Text:PDF
GTID:2555307055496904Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Neural network-based speech synthesis models are usually divided into two parts,the acoustic model and the vocoder,which are cascaded,but current more advanced approaches use fully end-to-end speech synthesis models that are capable of generating the waveform of speech directly from text and learning the features between text and speech through linear spectra for coaching.The approach using a fully end-to-end model effectively solves the problem of mismatch between the output of the acoustic prediction module and the input of the vocoder module that exists in the cascaded speech synthesis model approach.There are some challenges in conducting the current research on Tibetan Lhasa speech synthesis,the lack of a large-scale high-quality Tibetan Lhasa corpus,and also the existence of multiple choices of modeling units in the end-to-end framework.In this paper,we address these problems by investigating Tibetan multi-speaker speech synthesis based on a fully end-to-end approach and designing and developing an online Tibetan Lhasa speech synthesis system.In this paper,a single-speaker synthesis corpus is constructed for Tibetan Lhasa,which includes 9387 male voice sentences with a duration of 9.19 hours and a size of 1.66 GB;a multipronouncer synthesis corpus is constructed,which includes 150 speakers with 75309 sentences of text with a duration of about 45 hours and a size of 15.1 GB.In order to improve the naturalness,intelligibility and clarity of the speech synthesized Tibetan Lhasa,this paper investigates the speech synthesis technology of Tibetan Lhasa based on a fully end-to-end model in terms of input text primitive selection and model structure of the speech synthesis model,realizes the speech synthesis of Tibetan Lhasa using a small amount of target speaker corpus and applies it to an online Tibetan Lhasa synthesis system.In the selection of text input primitives for speech synthesis models,this paper analyzes the impact of two different text primitives,phonemes and Tibetan letters,on the speech synthesis performance of Tibetan Lhasa,based on the analysis of the structural characteristics of Tibetan text,and concludes that using Tibetan letters as text input primitives for different speech synthesis models is more suitable for Tibetan Lhasa speech synthesis under the current speech synthesis framework,with MOS scoring improved by 0.15 under the cascaded model and by 0.11 under the fully end-to-end model.In terms of speech synthesis model selection,this paper compares three different speech synthesis models,which include two cascaded speech synthesis models and a fully end-to-end speech synthesis model.Compared with the cascaded model,the MOS score of the fully endto-end model improved by 0.12.In terms of fully end-to-end Tibetan Lhasa-based speech synthesis applications,this paper designs and implements Tibetan Lhasa speech synthesis using a small amount of target speaker corpus by using a speaker adaptation approach,and finally designs and implements an online Tibetan Lhasa speech synthesis system that uses a multi-speaker adaptive model to synthesize speech that timber the tones of a specified speaker.
Keywords/Search Tags:Fully end-to-end, Tibetan Lhasa, Speech Synthesis, Deep Learning
PDF Full Text Request
Related items