Font Size: a A A

Research On Emotional Speech Synthesis Based On Generative Adversarial Networks

Posted on:2021-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ShaoFull Text:PDF
GTID:2518306104986509Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The development of deep learning technology has brought benefit to all walks of life.Especially in speech synthesis technology,deep learning has achieved great success.The end-to-end speech synthesis technology led by Tacotron not only makes it easier to build speech synthesis system,but also makes the synthesized speech more understandable and natural.Today,the voice has gradually entered our life.All kinds of voice assistant,voice interaction function make our life more convenient.At present,the speech synthesis technology,which is still in the stage of producing human understandable voice,has a bottleneck,and it is still unable to express the expression of emotion and deliver vivid speech like human beings.This is the key witch prevent the speech synthesis system from being more widely used.At the same time,because of the development of end-to-end speech synthesis system,so the research on emotion is just beginning.This is a hot topic in the field of speech synthesis.Since it was put forward,the Generative Adversarial Network has been paid much attention to and caused waves in the field of computer vision.The Generative Adversarial Network has many applications,including generating fake photos,changing the style of images,and so on.Up to now,Generative Adversarial Network is still one of the hottest research directions of generation model.Unlike the popularity of Generative Adversarial Network in computer vision,few people use Generative Adversarial Network for speech synthesis.Inspired by the success of the Generative Adversarial Network in the field of image style conversion,this paper combines the Generative Adversarial Network with Tacotron2 to construct a new emotional speech synthesis system.The system uses text and prosodic features as input to synthesize emotional speech.The emotional speech synthesis system mainly consists of speech synthesis module and prosody extraction module.The speech synthesis module is a Tacotron2 model.Prosodic extraction module extracts prosodic features from a speech as input to Tacotron2.In this paper,the prosodic features are screened by traditional machine learning methods to ensure that the extracted prosodic features have a high correlation with emotion and a small collinearity between features.Finally,this paper trains the model with the idea of Conditional Generative Adversarial Network.The discriminator is responsible for the emotion constraint of the generated speech,and the generator is responsible for the sound fitting.The result is an emotional speech synthesis system that can be derived by modifying the prosodic characteristics of the input and by controlling the emotion of the output speech.The model is evaluated in intelligibility and naturalness.The intelligibility was evaluated by the speech recognition system's word error rate and the subjective MOS score.The results show that the model word error rate and MOS score in this paper far exceeded Tacotron2 and were equal to GSTTacotron2.In terms of naturalness,we evaluated the model by using The Mel Cepstrum Error and the F0 Frame Error.The results show that the F0 Frame Error of our model is 15% lower than that of GSTTacotron2,and the Mel cepstrum error is the same as that of GSTTacotron2,which proves that the model proposed in this paper is superior to GSTTacotron2 in the expression of naturalness.
Keywords/Search Tags:Emotional Speech Systhesis System, Tacotron2, Generative Adversarial Network
PDF Full Text Request
Related items