Deep Learning Based Speech Emotion Recognition Research

Posted on:2020-11-10

Degree:Master

Type:Thesis

Country:China

Candidate:P C Li

Full Text:PDF

GTID:2428330572987253

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Speech emotion recognition is the technology of automatically obtaining emotion types from given speech utterances,developing a high-accuracy speech emotion recognition system becomes an attractive research field in speech realm along with the growing demands in some areas such as commerce and education.And deep learning based speech emotion recognition methods,especially convolutional neutral network(CNN),draws researchers' attention.Some studies achieved good performance and show CNN's potential.However,there are still some questions that need to be explored.Firstly,which types of features are more appropriate for CNN speech emotion recognition model.Secondly how to design a appropriate network structure and make it able to learn emotion discriminative information effectively.Finally,the data insufficient problem greatly draws back the development of speech emotion recognition,thus how to exploit more auxiliary data to improve speech emotion recognition accuracy is to be explored.Based on these questions,this thesis conducts researches and experiments.To explore the influence of feature on speech emotion recognition,this thesis firstly establishes an end-to-end CNN speech emotion recognition model,and conducts experiments on multiple types of features.The experimental results show that the spectrogram has the best performance.Based on this,this thesis further explores the different frequency bands of spectrogram.and finds that the low frequency is the most important for speech emotion recognition.Then this thesis researches the CNN activations of different emotion types to explore the difference of high-level representations.These researches contribute to the realization of distribution properties of different emotion types on time-frequency region.To further utilize the high-level time-frequency information of CNN model and generate more effective emotion discriminative representation,this thesis utilizes bilinear pooling method to model high-level representation of speech emotion recognition model.This method can compute the correlations between each dimension of high-level output and generate more abundant emotion representation.However due to the limitation of emotional corpus' scale,the training of bilinear pooling model is difficult.Therefore this thesis utilizes factorized bilinear pooling to reduce the dimensions of output representation and significantly improves the accuracy of emotion recognition.Based on the theory of bilinear pooling,this thesis further proposes attention pooling based speech emotion recognition model,by introducing top-down and bottom-up attention map,the emotions are better distinguished and the performance is improved.In order to exploit additional information,solve the data insufficient problem,and improve recognition accuracy,this thesis proposes speech emotion recognition methods utilizing phoneme and speaker information.For phoneme feature,this thesis uses a two-branch CNN to train speech and phoneme features jointly.For speaker information,this thesis uses residual adapter model to perform domain adaptation from speaker to emotion.This method firstly exploits a speaker-labeled corpus to train a deep residual network,then uses a emotion corpus to train adapters.The method aims to utilize the auxiliary information of speaker corpus to improve speech emotion performance.The experiments show that the model utilizing phoneme and speaker information significantly outperforms speech-only model.

Keywords/Search Tags:

Speech emotion recognition, deep learning, time-frequency information utilization, bilinear pooling, auxiliary of phoneme and speaker

PDF Full Text Request

Related items

1	A Study Of Deep Learning Based Multimodal Emotion Recognition
2	Research On Speech Phoneme Recognition Based On Deep Learning
3	Research On Emotion Recognition Method Based On Multimodal Deep Learning
4	Extraction Of Speaker Individual Information By Suppressing Phoneme Effects Based On Frequency Characteristics
5	Research On Speech Emotion Recognition Method Based On Time Series Deep Learning Model
6	Deep Emotion Recognition Based On Speech And Semantics
7	Real-time Emotion And Phoneme Recognition Based On A Two-level Model
8	Speech Emotion Recognition Based On Deep Learning
9	Speech Emotion Recognition Based On Deep Learning Technology
10	Research On Key Technologies Of Speech Emotion Recognition