Font Size: a A A

Research On Multimodal Emotion Recognition Method Based On Transformer

Posted on:2024-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:J F DingFull Text:PDF
GTID:2568307136991709Subject:Electronic information
Abstract/Summary:
With the development of computer science,artificial intelligence is gradually entering people’s daily lives.In the field of artificial intelligence,emotion recognition plays an important role.Early emotion recognition technology was mainly based on single-modality,which had problems such as low accuracy and low utilization of resource samples.Multimodal emotion recognition,as an emerging interdisciplinary technology in the field of emotion analysis,while analyzing the information of single-modal features,it is also possible to capture the feature correlations between modalities.The thesis conducts emotion recognition research on facial expressions,speech,and text,as well as facial expressions,speech,and posture modalities based on the IEMOCAP and Multimodal databases.Firstly,the sample data in the two databases were processed.For expression modality,extract face key point features and face motion unit features after splitting the video samples into frames using Facet and Dlib tools.For speech modality,74-dimensional speech features including MFCC are extracted after segmentation of speech samples using COVAREP tool.For the text modality,the pre-trained Glove model was used to process each word into a 300-dimensional word vector.For the posture modality,150-dimensional skeleton point cloud feature including 25 human body skeletons and 6 spatial orientations were extracted from the image collected by Kinect.With the feature data extracted from each modality using the above methods,a deep learning network is designed to study emotion recognition.The main research content of the thesis includes:(1)In order to address the issue of low utilization of feature information in early fusion,while considering the importance of inter-and intra-modality feature interaction,based on the Transformer network,the thesis proposes an Inter-and Intra-modality Feature Interaction Network(IIFINet)with multi-head attention mechanism.This model achieves complementary and enhanced interactions between auxiliary modality features and target modality features through data-level fusion.The IIFINet model consists of two important modules,the Inter-modality Feature Global Interaction Network(Ir FGINet)based on multi-head cross-attention mechanism to capture potential feature mapping relationships between two modalities and enhance target modality features through crossattention,and the Intra-modality Feature Global Interaction Network(Ia FGINet)based on multi-head self-attention mechanism to enhance target features through capturing global feature relevance within a modality.Based on the Ia FGINet module,the thesis proposes two emotion recognition networks,including a single-modality emotion recognition network and a multimodal emotion recognition network based on IIFINet.(2)In order to introduce time-dependent relationships into IIFINet to simulate the continuity of emotional expression and efficiently extract enhanced feature information,the thesis proposes a Bidirectional Gated Recurrent Unit-Horizontal and Vertical(Bi GRU-HV)model based on spatial axis decomposition.This model is composed of Bidirectional Recurrent Neural Network(Bi GRU)modules and processes emotional data in parallel on the horizontal and vertical spatial axes to extract modality feature information with a wider receptive field.Based on IIFINet and Bi GRU-HV models,the thesis proposes a multimodal emotion recognition network based on a Multimodal Emotion Recognition Network(MERNet)with global feature interaction and feature extraction.Based on the proposed three emotion recognition networks,single-modality,bi-modality,and multimodal emotion recognition experiments are conducted on the IEMOCAP and Multimodal databases.The experimental results show that the singal-modality emotion recognition network based on the Ia FGINet module can efficiently perform the emotion recognition task by enhancing the intramodal features,the multimodal emotion recognition network based on the IIFINet model can complement and enhance the inter-and intra-modality features to effectively improve the recognition accuracy of multimodal emotions,and the multimodal emotion recognition based on the MERNet model network,which can capture the dependencies of feature elements after data enhancement and extract feature information efficiently,further improves the effect of multimodal emotion recognition.
Keywords/Search Tags:Emotion analysis, Multimodal emotion recognition, Transformer network, Feature interaction, Feature extraction
Related items