Font Size: a A A

Research On Feature Fusion Method Of Speech Emotion Recognition Based On Deep Learning

Posted on:2022-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:W D ZhouFull Text:PDF
GTID:2518306605498494Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Emotion recognition has broad application prospects in human-computer interaction and other fields.A machine that can understand emotion effectively will greatly improve the experience of human-computer interaction.Emotion has many carriers,while the voice is one of the most convenient one.Therefore,how to correctly recognize the speaker's emotion from the speech signal has attracted researcher's extensive attention.This paper carries out research on speech emotion recognition technology and its application.Based on the deep learning model,two speech emotion recognition methods are proposed in this paper,which are the nonlinear feature fusion method using attention mechanism and the method based on multi-channel 2-D convolution recurrent neural network.The nonlinear feature fusion method using attention mechanism to capture the nonlinear dependence between spatio-temporal features,and solves the problem that linear fusion cannot pay attention to it.Based on the method of multi-channel 2-D convolution recurrent neural network,the influence of different linear combinations of emotional features on emotion recognition results is solved.An interactive speech emotion recognition system is designed and developed,and the proposed two models are applied to practical conversation analysis to recognize the emotional changes of each speaker in multi-person conversation scenario.The specific research contents are as follows:(1)A nonlinear spatio-temporal feature fusion method using attention mechanism is proposed to solve the problem that linear spatio-temporal feature fusion cannot capture the dynamic dependence of spatio-temporal features in fine granularity.In our method,the temporal convolution network using attention mechanism is applied to find out the advanced features in speech spatial domain,the long short-term memory using attention mechanism is applied to find out the temporal features in speech,and the attention mechanism is applied for nonlinear spatio-temporal feature fusion.This method uses three attention mechanisms,in which the attention mechanism in temporal convolution network and long short-term memory is used to pay attention to the emotion related features contained in the high-level features extracted by itself,while the attention mechanism between models is used to pay attention to the dynamic dependence between spatio-temporal features.The experimental results show our method can perform better classification than linear fusion.(2)A method of multichannel 2-D convolution recurrent neural network is proposed.In this method,the original low level descriptors are segmented according to the types of features,and the segmented results are sent into different convolution channels respectively.The local information of each feature is extracted by 2-D channel convolution block,the output of each channel is transformed into the same dimension by linear layer,and the output results of the same dimension are spliced.The spliced output is used as the input of bi-directional long-short term memory.In this way,not only the independence of each feature,but also the global information in speech emotional features can play their respective roles.Finally,the attention mechanism is used to emphasize the emotion related part of the speech signal and ignore the silent part.The effectiveness of the proposed method is verified by our experiment.(3)An interactive speech emotion recognition system is designed and developed.The system can recognize the emotion of multi-person dialogue.The whole system is developed with QT,and the recorded speech signal is enhanced by spectral subtraction to remove environmental noise.Bayesian information criterion is used to obtain the time points with significant voice changes for separation.Voice activity detection is used to remove the silent segment.Voiceprint recognition technology is used to recognize the identity of specific speakers.Finally,two emotion recognition methods proposed in this paper are put into practical use in our system.In addition,the system also supports additional functions such as speech re-play,sequence and spectrum diagram display.
Keywords/Search Tags:speech emotion recognition, nonlinear spatio-temporal feature fusion, attention mechanism, multi-channel 2-D convolution recurrent neural network, interactive speech emotion recognition system
PDF Full Text Request
Related items