Font Size: a A A

Research On Speech Emotion Recognition Technology Based On Attention Mechanism

Posted on:2024-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:X H BaiFull Text:PDF
GTID:2568307100961879Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Human-computer interaction technology is gaining more attention because of the rapid progress of artificial intelligence.The identification of speech emotions has become a prominent research area and an essential component of human-computer interaction.Typically,a speech emotion recognition system consists of two parts:feature extraction and the emotion recognition model.The identification of rich acoustic features of emotion and the accurate recognition of those features are crucial factors affecting the accuracy of emotion recognition of speech.To improve the robustness of speech emotion recognition,this thesis investigates the speech emotion recognition technique based on attention mechanism from two directions: extracting rich acoustic features related to emotions and constructing a robust model for emotion recognition.The specific research work includes:(1)In terms of feature selection,3-dimensional Log Mels spectrograms(3D-Log Mels)are used as the input features of the speech emotion recognition model.As opposed to using traditional Mel-Frequency Cepstral Coefficients(MFCC)features,the linear scale used by MFCC cannot capture this perception well because the human auditory system perceives audio signals based on the logarithmic level of sound intensity.Not only is the 3D-Log Mels spectrogram used in this thesis more consistent with the human ear’s perception of the audio cues,but also contains both time-domain and frequency-domain information,which can obtain rich emotional acoustic features,minimizing the influence of irrelevant factors,as well as improving speech emotion recognition performance.(2)In terms of feature extraction,a parallel 2D Convolutional Neural Network(CNN)is used to extract time-frequency domain features simultaneously.Compared with using time-domain or frequency-domain features alone,the time-frequency domain-based feature extraction method captures both temporal and frequency information,which improves the accuracy of speech emotion recognition.(3)In terms of recognition model,Attention Mechanism(AM)is designed as the speech emotion recognition model.Firstly,we designed a self-attention mechanism-based emotion recognition model,using the self-attention mechanism to assign weights to the features,giving higher weights to the parts related to speech emotion and lower weights to the parts not related to speech emotion,so as to realize the full utilization of acoustic features.Although the emotion recognition model based on the self-attention mechanism can capture the remote dependencies by assigning weights to the patterns and thus obtain better recognition rates,the self-attention mechanism has the drawback of learning query-independent dependencies,and this thesis solves the drawback of the self-attention mechanism by using the Global Context Attention(GCA)mechanism,which uses In addition,based on the introduction of global context attention mechanism,we propose a speech emotion recognition model based on time-frequency features and dual global attention mechanism,which solves the problem of self-attention mechanism and uses the new dual global context attention to analyze the correlation between network layers and improve the robustness and recognition rate of the model.The goal of this thesis is to propose a model of speech emotion recognition based on time-frequency features and dual global contextual attention mechanisms,which enhances the ability to capture acoustic features of emotions and the accuracy of speech emotion recognition,achieving 70.08% and 94.21% accuracy,70.08% and94.08% F1 scores,70.91% and 94.21% recall rates.
Keywords/Search Tags:Speech Emotion Recognition, Time-Frequency Features, Attention Mechanism, Global Context Attention
PDF Full Text Request
Related items