Font Size: a A A

Research On Speech Emotion Recognition Based On Emotional Feature Enhancement

Posted on:2018-12-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Z XuFull Text:PDF
GTID:1318330542451435Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As a frequently investigated topic,Speech Emotion Recognition(SER)aims to estimate e-motional states based on spoken utterances.Typical applications of SER lie in Human-Computer Interaction(HCI),autism or depression detection,and negative emotion detection in extreme conditions.However,it is difficult to directly obtain specific features in recognising speech emo-tion,since the original paralinguistic features may contain some factors which are more effective for other tasks compared with emotion recognition.Inspired by this,emotional feature enhancement based SER is posed as the key topic of research in this dissertation.Emotional feature enhancement ways focus on constructing training models by integratedly considering information of training samples(i.e.,features and labels).Then,the emotional characteristics in the models are preserved and enhanced in reconstructing new features for samples.These features make it more suitable for target task to achieve improvements in recognition or estimation.Unfortunately,differing from some other popular topics in feature enhancement(e.g.,face recognition,speaker identification,and automatic speech recognition),emotion recognition in speech is more difficult to solve since there exist many factors disturbing the target task.Thus,current works related to this topic still cannot result in effectively recognising emotional states especially for application requirements.As a detailed topic in feature enhancement,subspace learning methods with various structures are especially investigated in this dissertation,in order to enhance emotional feature representation in speech.Evaluated by experiments,the proposed methods are effective in solving SER.The main contributions of the dissertation are shown as follows:(1)Multiscale kernels with the structure of multiple kernel subspace learning is employed in effectively recognising emotion in speech.With Fisher discriminant embedding graph,the multiscale Gaussian kernels are used in constructing optimal linear combination of Gram ma-trices for Multiple Kernel Learning(MKL).To evaluate the proposed Multiscale-Kernel Fisher Discriminant Analysis(MS-KFDA)method,comprehensive experiments,using different public feature sets from the open-source toolbox openSMILE on various corpora,show that the pro-posed method achieves better performance compared with conventional linear dimensionality reduction methods and single-kernel methods.(2)Further,we propose a novel method to learn multiscale kernels with locally penalised discriminant analysis,namely Multiscale-Kernel Locally Penalised Discriminant Analysis(MS-KLPDA).As an exemplary use-case,we apply it to recognise emotions in speech.Specifically,we employ the term of locally penalised discriminant analysis by controlling the weights of marginal sample pairs,while the method learns kernels with multiple scales.Evaluated in a series of experiments on emotional speech corpora,our proposed MS-KLPDA is able to outperform the previous research of MS-KFDA and some conventional methods in solving SER.(3)We propose a two-dimensional framework for multiple kernel subspace learning.This framework provides more linear combinations on the basis of MKL without nonnegative con-straints,which preserves more information in the learning procedures.It also leverages both of MKL and two-dimensional subspace learning,combining them into a unified structure.To apply the framework to SER,we also propose an algorithm,namely Generalised Multiple K-ernel Discriminant Analysis(GMKDA),by employing discriminant embedding graphs in this framework.GMKDA takes advantage of the additional mapping directions for multiple kernels in the proposed framework.In order to evaluate the performance of the proposed algorithm a wide range of experiments is carried out on several key emotional corpora.These experimental results demonstrate that,the proposed methods can achieve better performance compared with some conventional and subspace learning methods in dealing with SER.(4)Combining Extreme Learning Machine(ELM)and subspace learning,we propose a novel framework to combine Spectral Regression(SR)based Graph Embedding(GE),and ELM together.The proposed framework contains three stages,namely data mapping,graph decompo-sition,and regression.At the data mapping stage,it is available to employ various mapping ways to provide different views of samples.At the graph decomposition stage,decomposed into virtual coordinates,designing embedding graphs provides a possibility to better depict the structure of data.Finally at the regression stage,dimension-reduced mappings are accordingly achieved by connecting the virtual coordinates and data mapping.Following this framework,several novel dimensionality reduction algorithms are naturally proposed by the stages,specifically employed in computational paralinguistics(i.e.,SER).Afterwards,some related state-of-the-art methods are addressed in the evaluation procedure in comparison with our proposed algorithms.(5)We propose a bimodal emotion recognition method based on locally facial expression.In this method we jointly make use of locally facial expression in video signal for emotional feature enhancement,in addition to the paralinguistics in audio signal,with early feature confusion to better classify emotional states.Experimental results show that,the proposed bimodal ways outperform the single-modal methods.
Keywords/Search Tags:speech emotion recognition, paralinguistics, emotional feature enhancement, subspace learning, graph embedding, discriminant analysis, multiple kernel learning, spectral regression, extreme learning machines, bimodality
PDF Full Text Request
Related items