Font Size: a A A

Research On Multi-modal Speech Emotion Recognition Based On Additive Penalty Focal Loss

Posted on:2022-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:S YeFull Text:PDF
GTID:2518306569975619Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As an indispensable component of intelligent Human-Computer Interaction system,Speech Emotion Recognition(SER)has important research significance,and has broad application prospects in medical assistance,health management and life services.The rapid development of deep learning technology has injected new vitality into SER,but there are still many deficiencies in the research of SER at this stage.Most of the work focuses on the design of hand-crafted features or network structures,while neglecting the exploration on the design of loss functions and the synergistic effect between multiple modalities containing emotional information.In order to improve the performance of SER system,this paper carries out research on these issues,mainly including the following work:(1)The influence of convolutional neural network structure,pooling strategy and multihead self-attention on the model performance is explored.In view of the differences in emotional expression between different genders,a learning method adopting gender recognition as an auxiliary task is designed to mine and utilize additional potential information to help the model distinguish different emotion categories.(2)A novel loss function named Additive Penalty Focal Loss(APFL)is devised for easing the problem of fuzzy decision boundary and data difficulty imbalance.It introduces an angular penalty factor that strictly restricts the decision boundary conditions to enforce higher intraclass compactness and inter-class discrepancy,and a focal factor that adjusts the loss assigned to a sample according to its difficulty to instruct the model to focus more on the hard samples that are easily misclassified.It optimizes from two perspectives to guide the model to learn more effective affective discriminative features during the training process,such that the model performance can be improved.Experiments on three datasets including IEMOCAP,EMODB and SAVEE verify its effectiveness,and the model trained with APFL has significant performance advantages.(3)A multi-modal speech emotion recognition method combining image,text and audio information is proposed for the problem of single source of emotional information.Differ from general single-modal methods,it combines spectrogram features extracted by a convolutional neural network,text embeddings extracted by the pretrained language model BERT and audio features extracted by the pretrained audio model VGGish.By comprehensively using the abundant emotional information contained in different modalities,it can better capture the discriminative features between different emotion categories.Moreover,the auxiliary task of gender recognition and the APFL loss function are introduced for further model performance improvement.Compared with the baseline model,the proposed Additive Penalty Focal Loss based multimodal speech emotion recognition improves the weighted accuracy and unweighted accuracy by 3% and 5%,respectively,and it is superior to the performance of the state-of-the-arts.
Keywords/Search Tags:Deep Learning, Multi-modal Speech Emotion Recognition, Auxiliary Task Learning, Additive Penalty Focal Loss
PDF Full Text Request
Related items