Font Size: a A A

The Modeling Research For Speech Emotion Towards Expressive Speech Synthesis

Posted on:2017-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y GaoFull Text:PDF
GTID:1108330482479561Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech is one of the important human communication tools, in addition to passing the literal information, also expressing emotions through the changes of voice. The current research about emotion and speech is mainly focused on the exploration of the relationship between some specific emotional states and the variations of speech signal. Some directive clues have been observed referring to the relevance between emotions and the changes of acoustic parameters. But because of the diversity and the complexity of emotional expression, the numerical distribution of emotional acoustic parameters is more likely to be discrete. And in the application of expressive speech synthesis, the emotional information is derived from manual designation or specific database analysis. The research about emotion prediction considering the text content and the scene factors is still in its initial stage.This work is mainly focused on predicting emotions in speech based on textual analysis, solving the affective analysis problem for expressive speech synthesis.There are two major problems to be solved:1) related theories need to be sublimated, especially to depict the emotion accurately and describe the process of dynamic evolution; 2) modeling technology needs a breakthrough, which should be able to support multi-scale processing and dynamic evolution characterization, considering the complexity of the emotional factors during the emotion generating processes and the required parameters may be from multiple levels.For the first proplem, under the guidance of the theory and practice of psychology, reading science, broadcasting science and phonetics, we use the method combining psycholinguistics, perceptual phonetics experiments and data analysis to explore the correlation between emotional expression and speech features during the producing process of acted speech like reading or broadcasting, further to summarize a theoretical mechanism about the generation and development of speech emotion. Based on this, an emotion describing model is proposed, explaining different aspects about speech emotion from different perspectives, including cognitive appraisal, psychological feeling, physical response and utterance manner. The four perspectives complement each other, forming a distributed representation for speech emotion. Each perspective has a direct or indirect effect with the others in line with the producing mechanism of speech emotion. As the final output of the producing process of speech emotion, the utterance manner turns into an interface between emotional description and acoustic parameters, which is helpful to find a more obvious mapping relationship between them. On the basis of the describing scheme, a news speech emotion database is constructed and annotated manually, which verifies the rationality of the hypothetic generating process and the describing model together with the following predicting experiments.Towards the second problem, a text-based emotion predicting model is built through deep neural network, since the multi-layered nonlinear mapping structure of the deep network is consistent with the distributed structure of the multi-perspective describing model; on the other hand, it is easy to realize the modeling of the dynamic varying process of emotion and the correlations between features from different scales. In particular, excluding the effects beyond textual content, the topic model is adopted to extract a vector representation for texts in the semantic space. The emotional information on the document level, the paragraph level and the sentence level is predicted sequentially. Inside each level, a continuous process is formed from cognition to psychology then to physiology and utterance. The utterance is treated as the ultimate target and the other components are seen as sub targets. The sub targets participate in the training of their subsequent steps as part of known information. Among different levels, a top-down hierarchical structure is formed. The predicted results on higher levels take part in the prediction on lower levels as part of known information, providing a more global contextual reference. Finally the effectiveness of the predicting model is validated through experiments, which shows that the appending of the influence of sequential relationship inside an emotion epoch and the interrelationship among features from different scales improves the recall, precision and F1-value of the utterance manner prediction by 31.8%,10.3% and 22.8% respectively.The main innovation points of this paper are:(1) A multi-perspective emotion description model is proposed based on the analysis and induction of the generating process of emotion in speech, which gives detailed descripiton for the variations of the components and their developing relations in the generating process of emotion, and takes the utterance description as an interface between emotion and speech, in the hope to guide the subsequent adjustment of acoustic parameters during speech synthesis.(2) A text-based emotion computing model is built based on deep neural networks, which considers the impacts from different scales and the interactions between different emotion components, and supports the modeling of dynamic derivative relationship and the processing of multi-scale features.(3) Priori knowledge is introduced to deep neural network to make the intermediate structure of the network partially visible. Through the direct explicit setting of the network structure, the priori knowledge about the generating of speech emotion is exploited effectively, and the cost of training data and network scale is reduced. Meanwhile, the performance of the network is improved as well.
Keywords/Search Tags:Expressive speech synthesis, emotion generation, emotion description, text-based emotion prediction, deep neural network, visible intermediate layers
PDF Full Text Request
Related items