Font Size: a A A

Analysis Of Effective Fused Features And Model Evaluation For Speech Emotion Recognition

Posted on:2019-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:2428330626452093Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The field of man-machine communication has witnessed a tremendous improvement in recent years,especially for intelligent voice assistants such as Siri,Cortana,Google Assistant,etc.We still have difficulties in communicating with machines naturally.Speech emotion recognition(SER)solutions are becoming one of the latest trends in the field of man-machine communication.It is believed that speech emotion is particularly useful in human-computer interface,because the emotion carries the essential semantics and helps machines better understand human speech.However,speech emotion recognition is technically challenging because it is not clear what kinds of speech features are salient to characterize different emotions.The purpose of this study is to find features that can significantly improve the accuracy of speech emotion recognition.Conventional SER has a wealth of achievements using perceptual features that refer to acoustic features of speech.In recent years,the convolutional neural network(CNN)has exhibited a great power at mining deep information from raw spectrograms,but the knowledge based on perceptual features was not utilized sufficiently as did in the traditional method.To address this problem,we propose a novel feature fusion strategy to utilize comprehensive spectrographic information and the prior knowledge simultaneously.Firstly,the frame-level low-level descriptors(LLDs)of perceptual features are arranged as timesequence LLDs for effective learning by CNN.Next,spectrogram and time-sequence LLDs are fused as compositional spectrographic features(CSF).Since CSF lack global and dynamic information,statistical features are added to generate rich-compositional spectrographic features(RSF).Then,the fused features in the form of two-dimensional(2-D)images are fed to CNN to extract hierarchical features.Finally,the bi-directional longshort term memory(BLSTM)is employed to utilize the context information and recognize emotions.Compared with raw spectrogram,our results show that CSF and RSF improve the unweighted accuracy by a relative error reduction of 32.04% and 36.91%,respectively.Gender information has been widely used to improve the performance of SER due to different expressing styles of men and women.However,conventional methods cannot adequately utilize gender information by simply representing gender characteristics with a fixed unique integer or one-hot encoding.In order to emphasize the gender factors for SER,we propose two types of features for our framework,namely distributed-gender feature and gender-driven feature.The distributed-gender feature is constructed in a way to represent the gender distribution as well as individual differences,while the gender-driven feature is extracted from acoustic signals through a deep neural network(DNN).These two proposed features are then augmented into the original spectrogram respectively to serve as the input for the following decision-making network,where we construct a hybrid one by combining CNN and BLSTM.Compared with spectrogram only,adding the distributed gender feature and gender driven feature in gender-aware CNN-BLSTM improved unweighted accuracy by relative error reduction of 14.04% and 45.74%,respectively.
Keywords/Search Tags:Speech emotion recognition, Spectrogram, Perceptual features, Gender information, Convolutional neural network, Bi-directional long short-term memory, Deep neural network
PDF Full Text Request
Related items