Font Size: a A A

Deep Learning Based Speech Emotion Recognition Research

Posted on:2020-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:P C LiFull Text:PDF
GTID:2428330572987253Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech emotion recognition is the technology of automatically obtaining emotion types from given speech utterances,developing a high-accuracy speech emotion recognition system becomes an attractive research field in speech realm along with the growing demands in some areas such as commerce and education.And deep learning based speech emotion recognition methods,especially convolutional neutral network(CNN),draws researchers' attention.Some studies achieved good performance and show CNN's potential.However,there are still some questions that need to be explored.Firstly,which types of features are more appropriate for CNN speech emotion recognition model.Secondly how to design a appropriate network structure and make it able to learn emotion discriminative information effectively.Finally,the data insufficient problem greatly draws back the development of speech emotion recognition,thus how to exploit more auxiliary data to improve speech emotion recognition accuracy is to be explored.Based on these questions,this thesis conducts researches and experiments.To explore the influence of feature on speech emotion recognition,this thesis firstly establishes an end-to-end CNN speech emotion recognition model,and conducts experiments on multiple types of features.The experimental results show that the spectrogram has the best performance.Based on this,this thesis further explores the different frequency bands of spectrogram.and finds that the low frequency is the most important for speech emotion recognition.Then this thesis researches the CNN activations of different emotion types to explore the difference of high-level representations.These researches contribute to the realization of distribution properties of different emotion types on time-frequency region.To further utilize the high-level time-frequency information of CNN model and generate more effective emotion discriminative representation,this thesis utilizes bilinear pooling method to model high-level representation of speech emotion recognition model.This method can compute the correlations between each dimension of high-level output and generate more abundant emotion representation.However due to the limitation of emotional corpus' scale,the training of bilinear pooling model is difficult.Therefore this thesis utilizes factorized bilinear pooling to reduce the dimensions of output representation and significantly improves the accuracy of emotion recognition.Based on the theory of bilinear pooling,this thesis further proposes attention pooling based speech emotion recognition model,by introducing top-down and bottom-up attention map,the emotions are better distinguished and the performance is improved.In order to exploit additional information,solve the data insufficient problem,and improve recognition accuracy,this thesis proposes speech emotion recognition methods utilizing phoneme and speaker information.For phoneme feature,this thesis uses a two-branch CNN to train speech and phoneme features jointly.For speaker information,this thesis uses residual adapter model to perform domain adaptation from speaker to emotion.This method firstly exploits a speaker-labeled corpus to train a deep residual network,then uses a emotion corpus to train adapters.The method aims to utilize the auxiliary information of speaker corpus to improve speech emotion performance.The experiments show that the model utilizing phoneme and speaker information significantly outperforms speech-only model.
Keywords/Search Tags:Speech emotion recognition, deep learning, time-frequency information utilization, bilinear pooling, auxiliary of phoneme and speaker
PDF Full Text Request
Related items