Research On Feature Fusion Method Of Speech Emotion Recognition Based On Deep Learning

Posted on:2022-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:W D Zhou

Full Text:PDF

GTID:2518306605498494

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Emotion recognition has broad application prospects in human-computer interaction and other fields.A machine that can understand emotion effectively will greatly improve the experience of human-computer interaction.Emotion has many carriers,while the voice is one of the most convenient one.Therefore,how to correctly recognize the speaker's emotion from the speech signal has attracted researcher's extensive attention.This paper carries out research on speech emotion recognition technology and its application.Based on the deep learning model,two speech emotion recognition methods are proposed in this paper,which are the nonlinear feature fusion method using attention mechanism and the method based on multi-channel 2-D convolution recurrent neural network.The nonlinear feature fusion method using attention mechanism to capture the nonlinear dependence between spatio-temporal features,and solves the problem that linear fusion cannot pay attention to it.Based on the method of multi-channel 2-D convolution recurrent neural network,the influence of different linear combinations of emotional features on emotion recognition results is solved.An interactive speech emotion recognition system is designed and developed,and the proposed two models are applied to practical conversation analysis to recognize the emotional changes of each speaker in multi-person conversation scenario.The specific research contents are as follows:(1)A nonlinear spatio-temporal feature fusion method using attention mechanism is proposed to solve the problem that linear spatio-temporal feature fusion cannot capture the dynamic dependence of spatio-temporal features in fine granularity.In our method,the temporal convolution network using attention mechanism is applied to find out the advanced features in speech spatial domain,the long short-term memory using attention mechanism is applied to find out the temporal features in speech,and the attention mechanism is applied for nonlinear spatio-temporal feature fusion.This method uses three attention mechanisms,in which the attention mechanism in temporal convolution network and long short-term memory is used to pay attention to the emotion related features contained in the high-level features extracted by itself,while the attention mechanism between models is used to pay attention to the dynamic dependence between spatio-temporal features.The experimental results show our method can perform better classification than linear fusion.(2)A method of multichannel 2-D convolution recurrent neural network is proposed.In this method,the original low level descriptors are segmented according to the types of features,and the segmented results are sent into different convolution channels respectively.The local information of each feature is extracted by 2-D channel convolution block,the output of each channel is transformed into the same dimension by linear layer,and the output results of the same dimension are spliced.The spliced output is used as the input of bi-directional long-short term memory.In this way,not only the independence of each feature,but also the global information in speech emotional features can play their respective roles.Finally,the attention mechanism is used to emphasize the emotion related part of the speech signal and ignore the silent part.The effectiveness of the proposed method is verified by our experiment.(3)An interactive speech emotion recognition system is designed and developed.The system can recognize the emotion of multi-person dialogue.The whole system is developed with QT,and the recorded speech signal is enhanced by spectral subtraction to remove environmental noise.Bayesian information criterion is used to obtain the time points with significant voice changes for separation.Voice activity detection is used to remove the silent segment.Voiceprint recognition technology is used to recognize the identity of specific speakers.Finally,two emotion recognition methods proposed in this paper are put into practical use in our system.In addition,the system also supports additional functions such as speech re-play,sequence and spectrum diagram display.

Keywords/Search Tags:

speech emotion recognition, nonlinear spatio-temporal feature fusion, attention mechanism, multi-channel 2-D convolution recurrent neural network, interactive speech emotion recognition system

PDF Full Text Request

Related items

1	Research On Speech Emotion Recognition Based On Deep Learning
2	Research On Speech Emotion Recognition Based On Multi Features Fusion
3	Research On Speech Emotion Recognition Algorithm Based On Deep Learning
4	Research On Speech Emotion Recognition Method Based On Multi-feature And Multi-modal Fusion
5	Research On Key Techniques Of Speech Emotion Recognition
6	Speech Emotion Recognition Based On Deep Learning
7	Research On Multi-modal Emotion Recognition Method Combining Speech And Expression
8	Research On Unspecified Person Speech Emotion Recognition Based On Neural Network
9	Dual Fusion Speech Emotion Recognition Based On Deep Learning
10	Research On Key Technologies Of Speech Emotion Recognition