Font Size: a A A

Research On Text-To-Speech Quality Evaluation Based On Deep Learning And Emotion Features

Posted on:2020-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:M TangFull Text:PDF
GTID:2428330620960011Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
At present,speech synthesis technology is becoming more and more mature and speech synthesis products have been widely used in our daily life.Although traditional speech synthesis technology has basically realized high naturalness speech,there is still no effective research on the synthesis of high expressiveness speech.The existing speech synthesis system has such problems as incoherent speech,stiff expression and lack of emotion,which resulting in poor user experience.That is closely related to the evaluation system of speech synthesis quality.The traditional speech synthesis quality evaluation system mainly focuses on the intelligibility evaluation of the speech content,which is unable to effectively measure the speech in coherence,stiff expression and lack of emotion in the synthesized speech.In this project,we study the influence of rhythm,duration,acoustics and other factors on the quality of synthesized speech,and establish a regression model for quality assessment.In the speech quality evaluation system based on deep learning,our research focuses on the synthetic speech quality of Mandarin.Therefore,we choose The Blizzard Challenge 2008-2010 Chinese data as the training and test data.At the same time,we have built our own test corpus,whichfully takes into account the characteristics of syllable balance and emotional diversity,so that the model is sufficient and the corpus coverage is wide enough.In this project,the speech quality evaluation system mainly adopts LSTM + LR neural network structure.In order to evaluate the quality of synthetic speech products from multiple aspects,channel characteristics of speech signals,such as p.563 parameters are added in the signal transmission level.In the aspect of emotional expression,we use pitch and energy parameters to evaluate the emotional vividness of synthesized speech.This paper takes various factors of speech synthesis into consideration and makes a systematic and comprehensive evaluation from the perspective of user experience.The prosody,duration and acoustic factor characteristics of corpus were obtained by feature extraction algorithm.Then,we analysis the mapping relationship between the subjective evaluation results and the speech features.The test results show that,compared with the artificial subjective evaluation results,the RMS error of the system output is 0.4 and the correlation coefficient is 0.7.The experimental results prove that the objective evaluation results are in good consistency with the subjective evaluation results.
Keywords/Search Tags:Speech synthesis system, objective evaluation, naturalness, deep neural network, LSTM+LR, emotional characteristics
PDF Full Text Request
Related items