Research On Text-To-Speech Quality Evaluation Based On Deep Learning And Emotion Features

Posted on:2020-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:M Tang

Full Text:PDF

GTID:2428330620960011

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

At present,speech synthesis technology is becoming more and more mature and speech synthesis products have been widely used in our daily life.Although traditional speech synthesis technology has basically realized high naturalness speech,there is still no effective research on the synthesis of high expressiveness speech.The existing speech synthesis system has such problems as incoherent speech,stiff expression and lack of emotion,which resulting in poor user experience.That is closely related to the evaluation system of speech synthesis quality.The traditional speech synthesis quality evaluation system mainly focuses on the intelligibility evaluation of the speech content,which is unable to effectively measure the speech in coherence,stiff expression and lack of emotion in the synthesized speech.In this project,we study the influence of rhythm,duration,acoustics and other factors on the quality of synthesized speech,and establish a regression model for quality assessment.In the speech quality evaluation system based on deep learning,our research focuses on the synthetic speech quality of Mandarin.Therefore,we choose The Blizzard Challenge 2008-2010 Chinese data as the training and test data.At the same time,we have built our own test corpus,whichfully takes into account the characteristics of syllable balance and emotional diversity,so that the model is sufficient and the corpus coverage is wide enough.In this project,the speech quality evaluation system mainly adopts LSTM + LR neural network structure.In order to evaluate the quality of synthetic speech products from multiple aspects,channel characteristics of speech signals,such as p.563 parameters are added in the signal transmission level.In the aspect of emotional expression,we use pitch and energy parameters to evaluate the emotional vividness of synthesized speech.This paper takes various factors of speech synthesis into consideration and makes a systematic and comprehensive evaluation from the perspective of user experience.The prosody,duration and acoustic factor characteristics of corpus were obtained by feature extraction algorithm.Then,we analysis the mapping relationship between the subjective evaluation results and the speech features.The test results show that,compared with the artificial subjective evaluation results,the RMS error of the system output is 0.4 and the correlation coefficient is 0.7.The experimental results prove that the objective evaluation results are in good consistency with the subjective evaluation results.

Keywords/Search Tags:

Speech synthesis system, objective evaluation, naturalness, deep neural network, LSTM+LR, emotional characteristics

PDF Full Text Request

Related items

1	Research On Emotional Speech Synthesis Based On Deep Neural Network
2	Research And Application Of Speech Synthesis Method Integrating Emotional Expressiveness
3	Research Of Improving Naturalness In Speech Synthesis
4	Emotional Speech Synthesis Based On Neural Network
5	Research On Design Of Objective Function For Deep Neural Network Based Speech Enhancement
6	Emotional Speech Synthesis Based On The Curve Regression
7	Mandarin's Synthesis By HTS And Research On Its Naturalness
8	Speech Synthesis Oriented Deep Learning Research And Application
9	Research On Neural Network Based Statistical Parametric Speech Synthesis
10	Speech Enhancement Based On Deep Neural Network And Recurrent Neural Network