Font Size: a A A

Research On Affective Speech Synthesis

Posted on:2014-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LuFull Text:PDF
GTID:2268330422460119Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the wide using of human-computer interactive system in recent years,speech synthesis technology has been concerned by many people. Although speechsynthesis has achieved good effects in terms of clarity, intelligibility and naturalness,human-computer interactive system is mainly neutral voice-based and lacking inemotional expression at present. Human’s voice communication, however, not onlyincludes basic verbal content, but carries a large number of abundant emotionalinformation. Therefore, emotional speech synthesis becomes the international hotresearch. This article introduces a three-dimensional emotional model of PAD(Pleasure-Arousal-Dominance), establishing an emotional corpus with11kinds ofemotional, and marks PAD value of emotional speech. And on this basis, using fivedegree tone model creates a baseband model of the emotional speech, and GRNN(Generalized Regression Neural Network) achieves a rhythm of emotional speechconversion. The thesis, further, uses speaker adaptive training (SAT) methods toachieve the statistical parameters speech synthesis of emotional speech. The mainwork and innovation of the thesis is as follows:Firstly,we establish an emotional speech corpus. The corpus records a femalespeaker’s11kinds of typical emotions, including neutral, relax, surprise, tender, joy,anger, anxiety, disgust, contempt, fear, sadness, and brings in the three-dimensionalemotional model PAD, marked emotional PAD values of speech corpus, markedrhythmic structure of the text corpus.Secondly,we propose an emotional speech rhythm conversion method based onthe PAD three-dimensional emotion model. Using five degree tone model establishesemotional speech baseband envelope model, and using GRNN achieves emotionalspeech rhythm conversion. Experimental results show that the maximum RMSE errorof emotional speech baseband envelope by five degree tone model is less than6.9Hz,and meet the requirements of the baseband curve modeling. Under the95%confidence interval, the average EMOS score of emotional speech, which is gainedthrough the transformation of GRNN model, is3.6points. The score shows that it canexpress the emotional information.Finally,we propose a method of emotional speech statistical parameters synthesisbased speaker adaptive training (Speaker Adaptive Training, SAT). The thesis designscontext-sensitive text annotation format, and creates an emotional speech problem set.By mixing multiple speak Mandarin corpus and a speaker’s emotional speech corpus,a average sound model can get from speaker adaptive training. And then through thespeaker adaptive conversion, using a lot of the speaker’s emotional training speech,the emotional speech model of speaker dependent (SD) can be obtained from theaverage sound model. Thereby the emotional speech can be synthesized.Experimental results show that with the proposed method, the average EMOS score of synthesizing emotional speech is2.7. This is superior to that of model which onlyusing the emotional speech training EMOS score.
Keywords/Search Tags:Affective Speech, PAD, Five Degree Tone Model, ProsodyConversion, Hidden Markov Models, Speaker Adaptive Training
PDF Full Text Request
Related items