Font Size: a A A

Mongolian Speech Synthesis Based On Deep Learning

Posted on:2021-04-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:R LiuFull Text:PDF
GTID:1368330620476644Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The main task of speech synthesis(or Text-to-Speech(TTS))is to map an input text to a waveform file.It involves acoustics,linguistics,digital signal processing,computer science,and other subjects.This technology can be widely used in the smart home,virtual anchors,voice navigation,information broadcasting,education,pan-entertainment,and other fields and it is significant in human-computer interaction.Recently,researchers have shown an increased interest in applying deep learning technology to conduct in-depth research on Mongolian intelligent information processing related issues.Due to the powerful performance of the deep learning model,Mongolian TTS has gained a significant improvement.However,compared with other mainstream languages,such as Chinese and English,the performance of current Mongolian TTS is not mature enough,and meanwhile,the further deep research is necessary to meet the practical requirements about the quality of synthesized speech.Compared with the natural speech,the synthesized speech generated by the Mongolian TTS system is still far from perfect and lacks naturalness and expressiveness,for its poor performance of prosody and quality.And the reason mainly lies in the fact that there exists a certain deficiency in prosody modeling and acoustic modeling capability.To address these issues,this dissertation proposes some deep learningbased methods in terms of these two aspects.For the prosody model,we make full use of Mongolian knowledge and utilize multi-task learning skills to improve the prosody model.For the acoustic model,we utilize knowledge distillation strategy and explicit prosodic knowledge to improve the end-to-end acoustic model.The innovation and main contributions of this dissertation are summarized as follows:1.We proposed to combine Mongolian morphological and phonological knowledge to model the Mongolian prosody structure.In order to improve the prosodic performance of the Mongolian TTS model,this work proposed two prosody modeling methods incorporating Mongolian morphological and phonological knowledge.The first method called "morpheme units based Mongolian prosody model".This method transforms Mongolian words into morpheme units and then uses these morpheme units to predict the Mongolian prosody structure.The second method called "Mongolian prosody model using morphological and phonological embeddings",which takes Mongolian word embeddings,morphological and phonological embeddings as a joint input to improve the accuracy of the Mongolian prosody model.Experiment results show that these two methods can improve the prosodic performance of the Mongolian TTS model.2.We proposed a Mongolian prosody modeling method with multi-task learning.Mongolian prosody model and the Mongolian grapheme-to-phoneme(G2P)have a natural correlation.The traditional Mongolian prosody model does not consider the relationship between these two tasks and ignore the relevant task information.To solve this problem,this method uses a multi-task learning mechanism to integrate the Mongolian prosody model and Mongolian G2 P task into a unified training framework.Through the joint training of these two tasks,the accuracy of Mongolian prosody modeling can be improved.Experiment results show that this method can improve the naturalness of the Mongolian TTS system effectively.3.We proposed a robust end-to-end acoustic model using a knowledge distillation strategy.To solve the exposure bias problem caused by the autoregressive decoding method of the decoder in the end-to-end acoustic model,this method uses the "teacher-student" training framework to train the acoustic model.We first train the teacher model that uses natural speech parameters as the decoder input called "teacher-forcing" decoding mode.Then we train the student model which takes the estimated speech parameters at the previous time step as the decoder input called "free-running" decoding mode.During the student model training process,the student model learns the decoder's hidden states of the teacher model and the natural speech parameter distribution at the same time through the knowledge distillation strategy.The experiment proves that this method can make the end-to-end acoustic model more stable and robust,and it can alleviate some problems such as the word skipping,missing,and repetition in the synthesis process.4.We proposed an acoustic modeling method with explicit prosodic information guidance.The end-to-end acoustic model is designed to learn the mapping relationship of <text,speech>,but the prosody model is included implicitly,which makes the model lack the guidance of explicit prosodic information during the training process and may limit its prosody performance as well.This method integrates prosody information into the acoustic model using the feature-level and model-level strategy respectively.For the feature-level strategy,a pre-trained prosody generator was used to obtain the prosody embeddings.Then we combine the prosody embeddings and character embeddings together to feed the text encoder and acoustic decoder.For the model-level strategy,we first use a prosody generator to obtain the prosody embeddings,then we use text encoder to obtain the high-level character embeddings.At last,these two embeddings are concatenated together into a single feature representation to feed the acoustic decoder.Note that the prosody generator was trained jointly with the acoustic model.Experiment results show that these two methods can improve the overall performance of the end-to-end Mongolian TTS model effectively.In summary,these proposed methods involved in this dissertation are feasible to improve the prosody model and acoustic model and let the Mongolian TTS system meet the practical requirements.It also sheds new light on future speech synthesis research about other agglutinative languages.At the same time,these works in this dissertation also contribute to the promotion of Mongolian intelligent information processing and the development of artificial intelligence technology in minority areas of China.
Keywords/Search Tags:Mongolian, Speech synthesis, Deep Learning, Prosody Modelling, Acoustic Modelling
PDF Full Text Request
Related items