Font Size: a A A

Research On Multi-emotional Speech Synthesis Technology Based On Short-term Specific Human Voice

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2518306569994529Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Voice is one of the most important way for people to obtain information in daily life.Using machines to simulate human voice has a wide range of applications in many fields such as smart devices and so on.In the field of speech synthesis,there has appeared several traditional methods such as waveform splicing synthesis methods,speech synthesis method based on modifying prosodic features,and statistical parameter speech synthesis methods based on hidden Markov models.However,these methods are hard to be applied in practice since these traditional synthesis methods have several kinds of disadvantages,such as the strong dependence on data sets,the obvious splicing traces of synthesized speech,the cumbersome process of synthesis system and so on.In recent years,with the rapid growth of deep learning and artificial intelligence technology,speech synthesis methods based on deep neural networks have achieved brilliant results.At the same time,more and more attention was paid on how to embed emotional tags in synthesized speech to make it closer to natural human voice.Besides,how to receive short-term speech signals as feature extract reference to achieve rapid and instant target speech synthesis is also an important problem.Based on the original speech synthesis technology,this article focuses on how to synthesize speech signals with target emotions according to short-term specific human voices.The research of multi-emotional speech synthesis model based on short-term specific human voice is mainly divided into two parts: training of average voice model based on deep neural network on large-scale data set and transform training of emotional speech synthesis GAN on emotional speech data set.In the training process of the average voice model,this thesis transform original text input information into phoneme to extract some of the knowledge that needs to be learned from the context text serious,so as to reduce the coupling of the text embedding vector in time series,which provides improvement room for model time cost performanc.In average voice model training stage,a simple network unit SRU is used to replace the traditional recurrent neural network RNN,which aims to remove the dependence of model training on the timing series relationship.The application of SRU makes it possible to use parallel training methods,and it can improve the training time performence of the model.In the multi-emotional speech synthesis model based on short-term specific human voices,we adopted the idea of transfer learning and introduced GAN network into the model.The average voice model was transfered as the generator of GAN and the discriminator was built based on neural network.In GAN emotional speech synthesis model,the discriminator was used to encode the input text information and speech signal separately,extract the text embedding and emotional style embedding.Besides,a minibatch discrimination method was used to train the GAN network in order to avoid the tendency that the generator may produce unitary kinds of samples.The experimental results show that the short-term specific human voice-oriented multi-emotional speech synthesis model obtained in this thesis can effectively construct the emotional voice of the target speaker.The model also has good performance in both subjective and objective evaluation system.
Keywords/Search Tags:speech synthesis, multi-emotional, short-term specific human voice, generative adversarial network
PDF Full Text Request
Related items