Emotional Speech Synthesis Based On Neural Network

Posted on:2022-08-31

Degree:Master

Type:Thesis

Country:China

Candidate:Z B Dai

Full Text:PDF

GTID:2518306560955639

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As the exit of human-machine speech interaction,the effect of speech synthesis directly affects the experience of human-machine interaction.A high-quality and stable speech synthesis system can make the machine more anthropomorphic and make the human-computer interaction process more natural.Currently,many excellent TTS models have been proposed to improve the quality of neutral speech,such as Tacotron2 and Wave Net,but most of these models use RNN or LSTM as encoder and decoder,and this autoregressive structure causes these models to be slow in training and prediction.In addition,with the continuous improvement of intelligent speech synthesis systems,there is an increasing demand for enhanced naturalness of speech.In recent years,the analysis and synthesis of emotional speech is becoming a new research hotspot,and more and more researchers are working on how to synthesize expressive emotional speech.However,in the field of emotional speech synthesis,there are few open-source emotional datasets,and most of the data come from different speakers,resulting in small datasets available for training,which limits the effectiveness of emotional speech synthesis models based on deep learning methods to some extent.In response to the above problems,the main work of this thesis is as follows:(1)To address the problems of inefficient training and prediction of RNN-based neural network speech synthesis models and long-distance information loss,an end-toend Bert-based speech synthesis model,Bert TTS,is proposed,which can synthesize high-quality English audio.Moreover,the model uses pre-trained Bert as an encoder,which can effectively solve the long-distance information loss problem like RNN while improving the training speed.(2)For the problem of selecting representative feature vectors for different emotions,a method based on the emotion distance ratio within the emotion dataset is proposed.The method considers both the distribution within each emotion sample and other emotion samples around it.And the method is proved to be superior to the mean-based feature vector representation method through experiments.(3)To address the problem of small size of emotional speech dataset,a method is proposed to synthesize emotional speech based on neutral TTS by fine-tuning it on a small batch of emotional speech dataset.Experiments show that the Bert TTS model proposed in this paper can improve the training speed by about double while obtaining similar results as the Tacotron2 model.Meanwhile,the neutral speech synthesis model proposed in this paper is able to synthesize clear emotional speech by fine-tuning the synthesis method on a small batch data set,and obtains an overall score of 3.77 in the MOS scoring test.

Keywords/Search Tags:

Emotional computing, Speech synthesis, Recurrent neural network(RNN), Seq2seq, Waveglow, Attention mechanism

PDF Full Text Request

Related items

1	Research And Application Of Speech Synthesis Method Integrating Emotional Expressiveness
2	Improved Tacotron2 Speech Synthesis Method Based On Forced Monotonic Attention Mechanism
3	Research On Speech Synthesis Algorithm Based On Sequence To Sequence Model
4	Research On Speech Enhancement Method Based On Parallel Optimize Recurrent Neural Network
5	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
6	Research On Emotional Speech Synthesis Based On Deep Neural Network
7	Research On Speech Emotion Recognition Based On Convolutional Recurrent Neural Network
8	Research On Algorithms Of Speech Synthesis Based On Deep Neural Network
9	Create An Emotional Speech Synthesis Corpus
10	Research And Optimize On Encoder Decoder In End-to-End Speech Recognition