Font Size: a A A

Research On Mandarin Text-to-Speech Based On Deep Learning

Posted on:2022-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:F NiuFull Text:PDF
GTID:2518306539998309Subject:Engineering
Abstract/Summary:PDF Full Text Request
Text-to-Speech(TTS),a technology that converts visual text into auditory speech,is a vital part of human-computer interaction.The state-of-the-art Deep Neural Network(DNN)end-to-end Text-to-Speech has been able to generate speech akin to human voice for Latin-based languages such as English.However,Chinese has another story,Chinese is a continuously written language with no spaces between adjacent words,and the correlation between Chinese characters and their pronunciation is poor.Therefore,the synthesized speech has serious prosody problems,such as inappropriate pauses or sentence breaks,mispronunciation of polyphonic characters,and poor naturalness when the state-of-the-art DNN-based TTS technologies are used for mandarin text-to-speech.This paper focused on the problems that the DNN-based TTS technologies applied to mandarin text,including the following aspects:1.The prosody of Chinese sentence is closely related to its context,and the prosodic structure of Chinese is similar to the syntactic structure of sentence,which is a tree-like hierarchical model,including different levels of prosodic components.Based on these features,this paper proposed Taco2?ML-Bert Emb system that integrates multi-level context features by constructing a multi-level context extractor.For better matching the prosody structures of Chinese,the context extractor with a pre-trained Chinese BERT model extracts multi-level context embedding from different-level inputs and incorporates it into the Tacotron2-based system.The proposed system explicitly utilizes the context information of the input text to realize more fine-grained prosody modeling in mandarin TTS.The Mean Opinion Score(MOS)of the synthesized speech by our proposed system is 0.74 and 0.48 points higher than the two benchmark models,and the Mel Cepstral Distortion(MCD)values is reduced by0.143 db and 0.12 db respectively.2.Based on the Taco2?ML-Bert Emb system,this paper constructed the Pro Enh?Taco2?ML-Bert Emb system for further exploring the application of word stress and prominence in prosody modeling by focusing on a prominence prediction network.This work first decomposed the prosody signal with the Continuous Wavelet Transform(CWT)technology to obtain the continuous prominence values corresponding to the Chinese characters.Then it used the obtained continuous prominence values to guide the training of the prominence prediction network to realize the mapping from the text to the prominence values.Experiments show that the Pro Enh?Taco2?ML-Bert Emb system proposed in this work can generate more expressive speech while taking into account the naturalness.3.Finally,this work has constructed a complete client-server(C/S)architecture mandarin speech synthesis system based on the state-of-the-art end-to-end Text-to-Speech technology.The whole system consists of two parts: server and client.Users can get high quality voice easily and quickly by inputting Chinese text in the client.
Keywords/Search Tags:Speech Synthesis, Text-to-Speech, Prosody, Mandarin TTS
PDF Full Text Request
Related items