Research On Speech Emotion Recognition Based On Multi-Attention Mechanism And Multi-Task Learning

Posted on:2024-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Xia

Full Text:PDF

GTID:2568307142952329

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology and sensor technology,human-computer interaction has entered a new era,and how to communicate naturally with machines has become a hot topic.One of the most valuable research directions is to extract the emotional features of speech based on the state of the voice,to predict the current emotional state of the speaker in real time,and to make the performance of the machine during human-computer interaction more intelligent and humane,thus enhancing the interactive experience with the machine.In recent years,deep learning has achieved significant developments.Extending this advancement to the domain of speech emotion recognition,deep learning-based learning and prediction approach has become a novel and promising research direction,which has been shown to enhance the accuracy of emotion recognition through diverse approaches.Based on a comprehensive analysis of existing literature,this paper aims to tackle the challenges of the traditional SER methods.Specifically,employing attention mechanisms and multi-task learning methods,our proposed approach addresses the difficulty in detecting speech emotions in long discourse and the impact of gender-related differences in speech signals and the subjective perception of external information.The main contributions of this study are summarized as follows:(1)An impersonal feature-based extraction method is proposed as a means of effectively addressing current challenges in emotion recognition.Current methods predominantly rely on personalized features,leading to an increased susceptibility to external variables.To extract differential features(including static,first-order differential,and second-order differential features)from speech signals,this new extraction method is employed.Thereupon,high-level Log-Mels features are obtained through time-directional filters and frequency-directional filters,and a standard CNN layer is then used to obtain impersonal speech emotion features.Experimental results from the IEMOCAP dataset demonstrate that the feature extraction and learning method proposed in this paper has strong expressive capabilities,with 3.47% and 2.93%improvement in WA and UA,respectively.(2)A Cascaded Attention Network(CAN)is proposed to effectively locate and extract emotional information in long utterances.The CAN includes a channel attention network,spatial attention,and self-attention.First,channel attention detects crucial regions within each channel that consist of non-personalized features.Second,spatial attention analyzes the spatial correlation between high-level features detected by channel attention,and obtains discourse-level features using a CNN-BLSTM network.Finally,the discourse-level features are fed into a self-attentive layer to learn local invariant features with different time steps and weighted scores between temporal dependence.Experimental results from the IEMOCAP dataset demonstrate that the optimized model improves sentiment information extraction in long utterance,with 3.25% and 4.18% improvement in WA and UA,respectively.(3)A multi-task learning strategy based on self-adaptive loss,which enables the dynamic determination of task weights,is proposed to distinguish the effects caused by the divergence in the perceptions of external information by gender and auditory systems.This approach continually adjusts the weights of different tasks based on gender-based signal variances and variations in external world perception.Experimental results from the IEMOCAP dataset demonstrate that the proposed model can apply various tasks to boost the precision of recognizing speech emotions and mitigate the disparities in auditory perception of external information,with 8.89%and 7.53% improvement in WA and UA,respectively.(4)A speech emotion recognition system was designed and implemented,and the above mentioned methods were combined and applied in the system.The above research results show that the speech emotion recognition based on multi-attention mechanism and multi-task learning proposed in this paper has some advantages over existing methods and has good generalizability to other complex situations.

Keywords/Search Tags:

Speech Emotion Recognition, Non-Personalized Feature, Cascaded Attention Network, Multi-Task Learning, Self-adaptive Loss

PDF Full Text Request

Related items

1	Research On Feature Fusion Method Of Speech Emotion Recognition Based On Deep Learning
2	Research On Speech Emotion Recognition Based On Deep Learning
3	Speech Emotion Recognition With Multitask Learning
4	Research On Key Techniques Of Speech Emotion Recognition
5	Speech Emotion Recognition Based On Multi-feature Combination And Attention Mechanism
6	Research On Speech Emotion Recognition Algorithm Based On Deep Learning
7	Research On Key Technologies Of Speech Emotion Recognition
8	Speech Emotion Recognition Model Based On 3D Attention Mechanism And Center Loss
9	Speech Emotion Recognition Based On Deep Learning And Multi-Feature Fusion
10	Research And Application Of Speech Emotion Recognition Algorithm Based On Deep Learning