Font Size: a A A

Research On Key Technologies Of Speech Emotion Recognition

Posted on:2014-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J HanFull Text:PDF
GTID:1228330422992408Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
Speech emotion recognition (SER) is an important branch of artificial intelligence.It is a technology to identify the emotional states of speakers by processing and analyzingthe speech signal. It can be widely used in range of applications, such as natural human-machine interaction, disease diagnosis and monitoring, fatigue detection, public securityand other fields. In recent years, with the development of psychology, physiology, neu-roscience and computer technology, SER technology has made remarkable progress, butdue to the complexity of emotion and update lag of emotion theory, there is still a biggap between current research and mature technology application. Therefore, given the re-search situation and demands in these days, this paper struggled to make research on SERat diferent levels, from the feature extraction over emotion description model upgradesto recognition modeling. The main research contents include:(1) The emotional prosodic granularity of diferent emotional states was given quan-titatively and two SER methods based on the combination of long and short term featureswere proposed. First, based on the local corpus, a qualitative analysis on how the prosodicfeatures and voice quality features change according to diferent emotions (happy, angry,sadness and surprise) was presented. Second, a quantitative analysis on the associationbetween the features’extraction duration and their ability for distinguishing emotionswas also discussed. Moreover, the best duration for feature extraction was suggested tomeasure the emotional prosodic granularity of a certain emotion. Then based on aboveanalysis as well as the continuity and progressiveness of human listening process, a Glob-al Control Elman Network (GCElman) model consisting a short-term feedback mech-anism in conjunction with a long-term control mechanism, and an Emotional ProsodicElman Network (EPElman) considering the prosodic diference between emotions wereproposed. Both of them were verified to be able to combine the short and long-term a-coustic features efectively, and achieved higher performance in contrast to the modelsprocessing these features individually.(2) The discrete emotion description model used in traditional SER was updat-ed, and the dimensional SER based on the dimensional emotion description model waslaunched. Moreover, taking into account the blank space in the domestic study on dimen-sional SER, the first mandarin dimensional emotional corpus’MREC’ containing data being collected completely from real life speech and spontaneous emotion were built andpublished, not only to provide a part of data support for following research, but also tosupplement today’s corpora resource. The recording, labeling, and assessment methodsof a dimensional speech emotion corpus in real-life were summarized as well.(3) Active learning (AL) strategy based dimension SER methods were proposed.Within the large-scale corpus, the difcult emotion scoring progress, and the heavy anno-tation burden in dimensional SER, an idea of guiding dimensional emotion annotation andmodel learning by means of AL method was proposed in this thesis. To this end, threedimensional AL methods were developed to estimate the informativeness of unlabeledsamples in advance, namely query-by-regression-committee, closest-to-boundary confi-dence, and diversity weighted confidence based AL methods. As shown by experimentalverification, by using these three proposed AL methods, samples with high training val-ue were selected for labeling and model training, and with the same amount of labelingefort, model performance was improved significantly. This the first fusion of AL theoryand dimensional SER research.(4) A Kullback-Leibler divergency based calculation of speech emotion predictionloss, and an Order Sensitive Network (OSNet) based dimensional SER approach wereproposed. Since the awareness of the important role played by emotion trend in determin-ing speakers’intents, views and attitudes, a proposal of considering both the numericalvalue approximation and order approximation during the process of dimensional SER waspresented. To this end, an improved regression model, named OSNet, was constructed fordimensional SER. Specifically, the key issue of model construction was formalized to theminimization of a predefined Loss Function consisting a numerical loss part and a orderloss part, then a neural network learning algorithm was adopted for the minimization pur-pose. In order to define the order loss properly, a probabilistic model was employed todescribe the sequential samples ranked by emotional values, then the Kullback-Leiblerprobabilistic divergency was used to quantify the order loss produced by prediction. Thismodel was verified to be able to surpass the generally used Support Vector Regressionmodel in preserving samples’ emotion orders in the task of dimensional SER. This workprovides a reliable technical support for human-machine interaction system in judgingusers’ emotion changes and then making right interactive decisions.(5) A Split Vector Quantization (SVQ) based distribute SER model was proposedwith the principle of”client low-cost, data transmission low-bandwidth, emotion recog- nition high-performance”. The speech acquisition, feature extraction and compressionmodules were placed in client-side, while the feature decompression and SER moduleswere placed in remote server-side, and the SVQ algorithm was used for the purpose offeature compression. Detailed investigation and analysis of the distribute model’s per-formance were presented in the context of recognizing emotions from real-life speech,including the investigation of how the codebook number and size impacted the requireddata transmission bandwidth and SER performance. As shown by the experimental re-sults, the proposed model achieved considerable SER performance with the compressionrate over forty in comparison with the stand-alone model. This study provides a efectivetechnical support for the promotion of SER over Internet.Overall, this dissertation proposed a number of new and efective solutions towardsthe key technical issues faced by SER field in these days, and constructed a good founda-tion for future SER research.
Keywords/Search Tags:speech emotion recognition, distributed speech emotion recognition, dimen-sional emotion description, emotion recognition modelling, feature combina-tion, active learning, neural network
PDF Full Text Request
Related items