Font Size: a A A

Speech Emotion Recognition Based On Multi-feature Combination And Attention Mechanism

Posted on:2022-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J M ZhangFull Text:PDF
GTID:2518306545990649Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
The recognition of the speaker's emotional state is an arduous task in the field of machine learning.Its main purpose is to automatically recognize emotions from human speech,that is,Speech Emotion Recognition(SER),which is used in the field of human-computer interaction.Plays a vital role.Emotional information in speech can effectively help robots understand the intention of natural interaction between humans.Its research mainly focuses on the low-level features of traditional hand-made and the advanced features learned by neural networks.Previous work mainly focused on extracting traditional hand-made emotional features,but with the development of the field of deep learning,neural networks have outstanding performance in the field of SER.However,SER still faces a major challenge in feature extraction,that is,how to choose a reliable method to extract salient features from speech to infer the speaker's emotional state,and build a high-performance feature representation network model,based on For the above research purposes,this article mainly contains the following contents:1)Research on deep learning emotion recognition model based on three-dimensional Log-Mel spectrogram.The preprocessing of the speech signal adopts a method based on the selection of key sequence segments.The similarity of the clusters is found through the Radial Basis Function Network(RBFN)and K-means clustering algorithm,and the sequence fragments are effectively processed.For the selection,the fragment is close to the centroid of the cluster and represents the remaining fragments.Then use Mel-frequency Cepstral Coefficients(MFCC)to convert the selected key sequence into a three-dimensional log-Mel spectrogram,which serves as a Convolutional Recurrent Neural Network(CRNN)network model Using the multi-scale convolution strategy,two sets of convolution kernels with different scales are designed to capture time domain and frequency domain information from the input data.2)Introduce the attention mechanism and build a multi-feature joint network model.The introduction of the attention mechanism is conducive to processing the silent segments of speech information and highlighting emotion-related information.In order to solve the limitations of the traditional hand-made low-level features and capture enough emotional features required for the SER task,the traditional low-level acoustic features and deep learning high-level semantic features are combined to generate a dual-channel HSF-CRNN-Attention model.These two features can cooperate with each other to extract a more robust and rich feature representation.The main research focus of this paper is mainly to discuss from the two directions of feature and network,in order to improve and build the multi-feature joint and attention mechanism network structure,extract the significant emotional feature representation,and use it in the IEMOCAP and EMO-DB data sets Conduct comparative analysis on the experiment.Experimental results show that the average recall rate and accuracy of the proposed method on EMO-DB are 82.88% and 82.43%,respectively;the average recall rate and accuracy on IEMOCAP are 66.1% and 64.18%,respectively.Compared with the CRNN baseline model,the recognition accuracy is increased by 5.75%,5.36%,5.93%,and5.45%,respectively,and the result of emotion recognition is significantly improved,which proves the effectiveness of the method in this paper.
Keywords/Search Tags:Speech Emotion Recognition, Deep Learning, Radial Basis Function, Attention Mechanism, Multi-feature join
PDF Full Text Request
Related items