Font Size: a A A

Deep Learning Models For Speech Emotion Recognition

Posted on:2021-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:2428330620975887Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Emotional intelligence is particularly important in human activities.Determining the emotional category is the core of emotional intelligence.Generally,the same semantic content may express different emotions,and different speakers express their emotions in different ways.Understanding semantic information alone is not enough to make the computer fully understand the speaker's intention.In order to make the computer fully understand the speaker's intentions,it is necessary to make the computer have emotional intelligence.The purpose of speech emotion recognition is to use computer to extract the features that can best represent emotion from speech,and to determine the emotion category of the speaker according to these features,so as to better realize human-computer interaction.The research of speech emotion recognition mainly faces the following problems:(1)the lack of unified database construction standards;(2)the lack of features that can best represent speech emotion;(3)the generalization and robustness of acoustic model are poor.In view of the above problems and the advantages of various neural networks,the contributions of this study are as follows:(1)a new acoustic model of speech emotion recognition is constructed by combining the recurrent neural network,convolution neural network and deep residual network.Recurrent neural network is used to process temporal information,convolution neural network captures spatial information,and the deep residual network solves the problem of gradient explosion or gradient vanishing;(2)attention mechanism and mask operation are introduced into the acoustic modeling of neural network.The attention mechanism is used to focus on the regions of emotional highlights and the mask operation is used to extract the regions of interest in the speech;(3)four new deep learning models are proposed.Attention mechanism based advanced long-short term memory network(AA-LSTM),attention mechanism based convolution bi-directional long-short term memory network(CBAM),attention mechanism based skip convolution bi-directional long-short term memory network(SCBAM),and attention mechanism based skip convolution bi-directional long-short term memory network with masking operations(SCBAMM);(4)the speech is converted into spectrum,and the proposed four new deep learning models are used to extract the 34-dimensional deep learning features and the 2-dimensional manual features such as harmonic noise ratio and pitch,and the combination of spectral features and speech acoustic features is used as the input of the acoustic model;(5)the performance of the four new deep learning models is verified on the EMO-DB database in this study.Experiments show that the four deep learning models proposed in this study: AA-LSTM,CBAM,SCBAM and SCBAMM,have achieved 70.09%,56.07%,64.49% and 72.09% recognition performance respectively on EMO-DB emotional speech database,it can be seen that SCBAMM model achieves the optimal classification.Besides,compared with other researchers' classification models,SCBAMM model also achieves the optimal performance.This is because SCBAMM model not only effectively extracts the features of time-frequency domain that can best represent emotion,but also combines the advantages of recurrent neural network,convolutional neural network and deep residual network,which has strong modeling ability.
Keywords/Search Tags:Speech emotion recognition, feature extraction, attention mechanism, deep neural network
PDF Full Text Request
Related items