Font Size: a A A

Research On Speech Emotion Recognition Based On Multi-Attention Mechanism And Multi-Task Learning

Posted on:2024-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q XiaFull Text:PDF
GTID:2568307142952329Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of information technology and sensor technology,human-computer interaction has entered a new era,and how to communicate naturally with machines has become a hot topic.One of the most valuable research directions is to extract the emotional features of speech based on the state of the voice,to predict the current emotional state of the speaker in real time,and to make the performance of the machine during human-computer interaction more intelligent and humane,thus enhancing the interactive experience with the machine.In recent years,deep learning has achieved significant developments.Extending this advancement to the domain of speech emotion recognition,deep learning-based learning and prediction approach has become a novel and promising research direction,which has been shown to enhance the accuracy of emotion recognition through diverse approaches.Based on a comprehensive analysis of existing literature,this paper aims to tackle the challenges of the traditional SER methods.Specifically,employing attention mechanisms and multi-task learning methods,our proposed approach addresses the difficulty in detecting speech emotions in long discourse and the impact of gender-related differences in speech signals and the subjective perception of external information.The main contributions of this study are summarized as follows:(1)An impersonal feature-based extraction method is proposed as a means of effectively addressing current challenges in emotion recognition.Current methods predominantly rely on personalized features,leading to an increased susceptibility to external variables.To extract differential features(including static,first-order differential,and second-order differential features)from speech signals,this new extraction method is employed.Thereupon,high-level Log-Mels features are obtained through time-directional filters and frequency-directional filters,and a standard CNN layer is then used to obtain impersonal speech emotion features.Experimental results from the IEMOCAP dataset demonstrate that the feature extraction and learning method proposed in this paper has strong expressive capabilities,with 3.47% and 2.93%improvement in WA and UA,respectively.(2)A Cascaded Attention Network(CAN)is proposed to effectively locate and extract emotional information in long utterances.The CAN includes a channel attention network,spatial attention,and self-attention.First,channel attention detects crucial regions within each channel that consist of non-personalized features.Second,spatial attention analyzes the spatial correlation between high-level features detected by channel attention,and obtains discourse-level features using a CNN-BLSTM network.Finally,the discourse-level features are fed into a self-attentive layer to learn local invariant features with different time steps and weighted scores between temporal dependence.Experimental results from the IEMOCAP dataset demonstrate that the optimized model improves sentiment information extraction in long utterance,with 3.25% and 4.18% improvement in WA and UA,respectively.(3)A multi-task learning strategy based on self-adaptive loss,which enables the dynamic determination of task weights,is proposed to distinguish the effects caused by the divergence in the perceptions of external information by gender and auditory systems.This approach continually adjusts the weights of different tasks based on gender-based signal variances and variations in external world perception.Experimental results from the IEMOCAP dataset demonstrate that the proposed model can apply various tasks to boost the precision of recognizing speech emotions and mitigate the disparities in auditory perception of external information,with 8.89%and 7.53% improvement in WA and UA,respectively.(4)A speech emotion recognition system was designed and implemented,and the above mentioned methods were combined and applied in the system.The above research results show that the speech emotion recognition based on multi-attention mechanism and multi-task learning proposed in this paper has some advantages over existing methods and has good generalizability to other complex situations.
Keywords/Search Tags:Speech Emotion Recognition, Non-Personalized Feature, Cascaded Attention Network, Multi-Task Learning, Self-adaptive Loss
PDF Full Text Request
Related items