| With the rapid development of artificial intelligence,intelligent speech synthesis,which plays an important role in human-computer interaction,has become mature.However,most of the current speech synthesis technologies lack the expression of rhythm and emotion.As important information in speech,they will affect the meaning of speech to a great extent.Inaccurate emotion will cause semantic ambiguity and poor human-computer communication.Therefore,simple text to speech model often can not achieve the ideal human-computer interaction effect in practical application.In order to solve the above problems,this paper will propose a model framework that can extract speech prosodic features based on the end-to-end model of deep learning,and propose a speech emotion synthesis method based on transfer learning.The main research contents are as follows:1.A speech style transfer method based on deep learning is proposed.This method is driven by the emotional speech data of English female voice,uses the end-to-end model tacotron2 for basic speech synthesis,and uses the variational self encoder to model the emotional prosody information in speech.In order to improve the prosodic information extraction ability of the variational self encoder,the quantized variational self encoder is used to obtain the discrete emotional prosodic representation.At the same time,it is found in the model training that good attention alignment is infinitely close to a diagonal in the image.In order to speed up the emergence of diagonal,that is,speed up the convergence speed of attention alignment and improve the speech synthesis ability of the overall model,this paper abandons the position sensitive attention of the basic speech synthesis model tacotron2 and adopts a forward attention that defines the alignment path,so that attention alignment appears earlier,Good alignment.Experimental results show that this method can better learn the emotional prosodic features in the reference audio,transfer the speech prosodic features of the reference audio to the text to be synthesized,and the speech synthesis effect is better.2.A speech emotion synthesis method based on transfer learning is proposed.Speech style transfer based on deep learning model can control the speech style of synthetic speech by extracting the prosodic features of different reference audio,but it can not accurately complete the speech synthesis task of specified emotion.Moreover,the affective synthesis model framework based on deep learning needs a lot of data,which is often very difficult to collect.The complete training of deep learning model also needs a lot of computing power and time.Therefore,in this paper,in order to solve the limitations and high time-consuming of speech style migration,a method of fine-tuning the emotion synthesis model using a small emotional speech data set based on the pretrained end-to-end speech synthesis model is proposed.Different from the speech synthesis model based on cyclic neural network used in the study of speech style transfer,this study is mainly based on convolutional neural network.Previous experiments have confirmed that the model training based on convolutional neural network is faster,and sometimes even better than the speech synthesis technology based on cyclic neural network.Experimental results show that this method can obtain emotional speech with specified emotional types,and the naturalness and emotion of synthetic speech are high. |