Research On Multi-style Text-to-speech Models

Posted on:2020-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:Z Ma

Full Text:PDF

GTID:2428330599959613

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Text-to-speech(TTS)model is a technique for generating speech from a given textual input and is an essential component in many applications such as speech-enabled devices,navigation systems,and accessibility for the visually-impaired.Ideally,the generated speech should convey the correct messages(intelligibility)while sounding like human speech(natureless)with the right prosody(expressiveness).Most speech synthesis systems are focused on solving the first two issues.Recently,deep learning yields great results across many fields,and we have witnessed exciting developments in the use of deep neural networks to TTS systems.Firstly,these systems espcially end-to-end systems alleviate the need of heavy laborious feature engineering and enable machines easily automatically extract more and more abstract and powerful features from the raw inputs.Secondly,it is more easy to controlly synthesis speech with different styles,such as varying speed,mulit-speaker,and different emotions within the same textual input along with various conditional attributes.Thirdly,it is more readily applicable to new dataset without any manual data anotation or additional feature engineering.Finally,a single end-to-end system is likely to be more robust than the traditional multi-stage models.In this thesis,we mainly focus on generating multi-style speech using deep neural networks.Our contributions is two-fold.Firstly,in order to cover a wide range of richness and expressivenesss of speaking styles,we introduce two kinds of dataset,which consist of pairs of auido clip and its corrsponding textual transcription,from bilingual animated movies.Secondly,we design two different models,which are multi-style tts model,cross-linguistic and multi-style tts model,to learn the rich styles underlying in clipped audios automatically and each model can be easily trained completely from scratch with random initialization.In the experimental part of the thesis,as there exists background noise in the preposed dataset,we use awesome training stategies to facility the training process and make it more stable.and we conduct a series of expriments to explore and interpret the learned models.

Keywords/Search Tags:

Text-to-Speech, Cross-linguistic TTS, Multi-style TTS, Deep Neural Networks

PDF Full Text Request

Related items

1	Research On The Dissemination Of Celebrities' Opinions Based On Linguistic Style
2	Research And Application Of Speech Separation Algorithm Based On Deep Neural Network
3	Research On Auto-regressive Deep Neural Networks' Based Monaural Speech Separation
4	Research On Deep Neural Networks Based Models For Speech Recognition
5	Design And Implementation Of A Cross-lingual Text Summary System Based On Deep Learning
6	Research And Impiementation Of Chinese Speech Synthesis Based On Deep Learning
7	Research On Cross-modal Retrieval Of Speech And Image Based On Deep Neural Network
8	Speech Enhancement Based On Deep Neural Network And Recurrent Neural Network
9	Research On Mandarin-to-Tibetan Cross Lingual Speech Conversion Based On Deep Neural Network
10	Research On Speech Bandwidth Extension Methods Using Neural Networks