Font Size: a A A

Emotional Speech Synthesis Based On Transfer Learning And Self-learning Emotional Representaion

Posted on:2020-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhangFull Text:PDF
GTID:2428330572473685Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer science and artificial intelligence,the speech synthesis technology as the core technology of human-computer interaction has achieved good results.However,the speech synthesis technology is mainly aimed at the synthesis of neutral speech,and the emotional speech synthesis technology still needs to be improved.As an important information,emotion will greatly change the content expressed by the speech.In the absence of emotional information,it will cause ambiguity in expression and unsatisfactory communication between human and computer.This paper analyzes the emotional representation in emotional speech synthesis,proposes a self-learning emotional representation method,and proposes an emotional speech synthesis method based on self-learning emotional representation.The main research contents are as follows:1.Aiming at the problem that the existing emotional representation has insufficient description power,the difference between different people when annotate emotional speech and the excessive cost of annotation,a self-learning emotional representation method is proposed,which use a self-coding neural network for modeling emotion in the speech.Adversarial training is used to ensure that emotion is speaker-independent.The experimental results show that the self-learning emotional representation has good performance without manual participation,and solves the problem of excessive cost and individual labeling difference.2.An emotional speech synthesis method based on transfer learning and self-learning emotional representation is proposed.The method uses the speaker discriminant model in the text-independent speaker verification for extracting the speaker's characteristics in emotional speech synthesis.Then the speaker's characteristics,self-learning emotional representation and text are fed into the end-to-end emotional speech synthesizer,and the Mel-spectrogram is obtained.Finally,the Mel-spectrogram is converted to emotional speech by WaveNet vocoder.The method does not require emotional annotation information and speaker label information during training,and is more flexible than other emotional speech synthesis methods.Finally,the experimental results show that the emotional speech synthesis method can synthesize speech with high naturalness and high sentiment with only a small amount of target speaker reference speech.
Keywords/Search Tags:emotional speech synthesis, emotion modeling, self-learning emotion representation, adversarial training, transfer learning
PDF Full Text Request
Related items