Font Size: a A A

Emotional Speech Synthesis Based On Neural Network

Posted on:2022-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z B DaiFull Text:PDF
GTID:2518306560955639Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the exit of human-machine speech interaction,the effect of speech synthesis directly affects the experience of human-machine interaction.A high-quality and stable speech synthesis system can make the machine more anthropomorphic and make the human-computer interaction process more natural.Currently,many excellent TTS models have been proposed to improve the quality of neutral speech,such as Tacotron2 and Wave Net,but most of these models use RNN or LSTM as encoder and decoder,and this autoregressive structure causes these models to be slow in training and prediction.In addition,with the continuous improvement of intelligent speech synthesis systems,there is an increasing demand for enhanced naturalness of speech.In recent years,the analysis and synthesis of emotional speech is becoming a new research hotspot,and more and more researchers are working on how to synthesize expressive emotional speech.However,in the field of emotional speech synthesis,there are few open-source emotional datasets,and most of the data come from different speakers,resulting in small datasets available for training,which limits the effectiveness of emotional speech synthesis models based on deep learning methods to some extent.In response to the above problems,the main work of this thesis is as follows:(1)To address the problems of inefficient training and prediction of RNN-based neural network speech synthesis models and long-distance information loss,an end-toend Bert-based speech synthesis model,Bert TTS,is proposed,which can synthesize high-quality English audio.Moreover,the model uses pre-trained Bert as an encoder,which can effectively solve the long-distance information loss problem like RNN while improving the training speed.(2)For the problem of selecting representative feature vectors for different emotions,a method based on the emotion distance ratio within the emotion dataset is proposed.The method considers both the distribution within each emotion sample and other emotion samples around it.And the method is proved to be superior to the mean-based feature vector representation method through experiments.(3)To address the problem of small size of emotional speech dataset,a method is proposed to synthesize emotional speech based on neutral TTS by fine-tuning it on a small batch of emotional speech dataset.Experiments show that the Bert TTS model proposed in this paper can improve the training speed by about double while obtaining similar results as the Tacotron2 model.Meanwhile,the neutral speech synthesis model proposed in this paper is able to synthesize clear emotional speech by fine-tuning the synthesis method on a small batch data set,and obtains an overall score of 3.77 in the MOS scoring test.
Keywords/Search Tags:Emotional computing, Speech synthesis, Recurrent neural network(RNN), Seq2seq, Waveglow, Attention mechanism
PDF Full Text Request
Related items