Font Size: a A A

Research On Multi-style Text-to-speech Models

Posted on:2020-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z MaFull Text:PDF
GTID:2428330599959613Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Text-to-speech(TTS)model is a technique for generating speech from a given textual input and is an essential component in many applications such as speech-enabled devices,navigation systems,and accessibility for the visually-impaired.Ideally,the generated speech should convey the correct messages(intelligibility)while sounding like human speech(natureless)with the right prosody(expressiveness).Most speech synthesis systems are focused on solving the first two issues.Recently,deep learning yields great results across many fields,and we have witnessed exciting developments in the use of deep neural networks to TTS systems.Firstly,these systems espcially end-to-end systems alleviate the need of heavy laborious feature engineering and enable machines easily automatically extract more and more abstract and powerful features from the raw inputs.Secondly,it is more easy to controlly synthesis speech with different styles,such as varying speed,mulit-speaker,and different emotions within the same textual input along with various conditional attributes.Thirdly,it is more readily applicable to new dataset without any manual data anotation or additional feature engineering.Finally,a single end-to-end system is likely to be more robust than the traditional multi-stage models.In this thesis,we mainly focus on generating multi-style speech using deep neural networks.Our contributions is two-fold.Firstly,in order to cover a wide range of richness and expressivenesss of speaking styles,we introduce two kinds of dataset,which consist of pairs of auido clip and its corrsponding textual transcription,from bilingual animated movies.Secondly,we design two different models,which are multi-style tts model,cross-linguistic and multi-style tts model,to learn the rich styles underlying in clipped audios automatically and each model can be easily trained completely from scratch with random initialization.In the experimental part of the thesis,as there exists background noise in the preposed dataset,we use awesome training stategies to facility the training process and make it more stable.and we conduct a series of expriments to explore and interpret the learned models.
Keywords/Search Tags:Text-to-Speech, Cross-linguistic TTS, Multi-style TTS, Deep Neural Networks
PDF Full Text Request
Related items