Font Size: a A A

Speech Emotion Recognition Based On Attention Mechanism

Posted on:2021-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LiangFull Text:PDF
GTID:2518306476950259Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As we all know,emotion recognition has always been an extremely important field in terms of human-computer interaction,involving many components such as artificial intelligence and emotional computing.Speech often expresses human emotion directly,so how to improve the accuracy of speech emotion recognition has been a hot research topic in the field of acoustics.Speech emotion recognition has important meanings in life.For example,the machine automatically recognizes the child’s emotions,and then communicates or records so that parents can better focus on the child’s mental health.As another example,through contactless recognition of the speech emotion of the person being interrogated,the police can be better assisted in finding the psychological activities of the suspect,such as deception detection.The importance of speech emotion recognition is obvious.However,due to the lack of modeling of the temporal relationship between speech and its modeling with emotion categories,these algorithms are still in the starting state in the above real life.With the support of the National Natural Science Foundation,this paper proposes a variety of speech emotion recognition algorithms based on attention mechanism for the temporal relationship of speech,which outperforms the current state-of-the-art speech emotion recognition algorithms on multiple public data sets.The main work and innovation of this thesis are as follows:(1)Learned the meaning of speech emotion recognition,and investigated the current most advanced algorithms for speech emotion recognition.Learned the application of attention mechanism between speech and natural language processing,and investigated the current most advanced algorithms of attention mechanism.(2)In order to extract more distinctive features for different emotions,firstly introduce the feature processing of speech emotion recognition,and propose a frame-level speech feature suitable for mining speech temporal relationships to replace the traditional static speech features,so as to retain the speech temporal information in emotion recognition as much as possible.It can be proved in this paper that certain dimensions of the temporal features can clearly distinguish different speech emotions.(3)In order to make Long-Short Term Memory(LSTM)more efficient processing of emotional features,a variant of LSTM: “Attention-LSTM” is proposed,which replaces the forget gate and input gate of traditional LSTM with attention gate that calculate self-attention for cell state.The proposed method greatly reduces the variable parameters of the LSTM structure,which can significantly improve the recognition accuracy while reducing the LSTM’s training time,and is superior to LSTM and multiple advanced variants.In terms of recognition accuracy,compared with baseline,the CASIA public data set has a relative improvement of0.7%,the Enterface public data set has a relative improvement of 7.5%,and the GEMEP public data set has a relative improvement of 5.2%.(4)In order to make the output of different emotion data in LSTM more discriminative,a method for dynamically screening the output of LSTM is proposed.Compared with the last state of the traditional LSTM time series output,this paper proposes two methods of attention screening: time-dimensional attention screening and feature-dimensional attention screening,which has ability to automatically weight the features of corresponding emotions that need attention from two dimensions at the same time.Then,the method combining the above Attention-LSTM,time-dimensional attention,and feature-dimensional attention was proposed,and the best model in this paper was obtained.In terms of recognition accuracy,compared with baseline,the CASIA public data set has a relative improvement of 3.1%,the Enterface public data set has a relative improvement of 18.2%,and the GEMEP public data set has a relative improvement of 17.5%.Compared with the SVM(Support Vector Machine)algorithm with static features,the model can improve the performance of 6.2%,60.6%,and 42.5% respectively on the corresponding three public databases.(5)In order to make full use of the output of different LSTM layers for emotion classification,a “Dense LSTM” is proposed based on the attention mechanism to perform feature selection and remove redundant information.Using the above two different-dimensional attention algorithms to perform feature screening between LSTM layers,significantly improving the recognition rate of speech emotions.In terms of recognition accuracy,compared with baseline,on the Enterface public data set,the relative improvement was 10.3% and 12.8%,and on the IEMOCAP public data set,the relative improvement was 10.9% and 17.4%,respectively.In summary,modeling the temporal relationship and emotion categories through the attention mechanism enables the LSTM internal,LSTM output,and multi-layer LSTM to extract features that significantly distinguish emotions,thereby effectively improving the recognition rate of speech emotions.It can further improve the speech emotion recognition algorithm under real conditions and promote the development of human-computer interaction.
Keywords/Search Tags:Attention Mechanism, speech emotion recognition, attention gate, timedimensional attention, feature-dimensional attention
PDF Full Text Request
Related items