Font Size: a A A

Multi-level Modality Representation Fusion For Emotion Analysis

Posted on:2021-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZouFull Text:PDF
GTID:2518306461970599Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Emotion recognition is an emerging interdisciplinary research field and one of the key technologies that enables machines to imitate humans.At present,in the natural language processing field,through deep learning and transfer learning techniques,the text modal has made great progress on emotion recognition task.However,with the development of social media technology,for example,software platforms such as Tik Tok,Kwai,Bilibili,etc.,the data for emotion recognition has gradually changed from a single text modal to a multi-modal including text,audio,and video data.The combination of multi-modal will carry more information,which makes up for the shortcomings of using only incomplete text modal information to hinder the decision-making process of emotion recognition.In addition,multi-modal emotion recognition has great potential applications in healthcare systems(as a tool of psychological analysis),human-computer interaction(accurately grasping user needs),etc.Currently,there are five key difficulties in the multimodal emotion recognition field:Representation,Translation,Alignment,Fusion and Collaborative learning.This article focuses on the two modalities of text and audio,and conducts in-depth exploration from the three challenges of Representation,Fusion and Collaborative learning.Firstly,extracting the effective representation features from single modal,and on this basis,using the methods that can fully integrate complementary information between different modalities to unify the multi-modal representations into the same vector space,and finally it solves the practical problem of mode missing which needs collaborative learning.In this paper,we propose a multi-level and multi-feature audio representation extraction method that combines feature engineering and recurrent neural network.The fusion strategy of auxiliary modal supervision training and the generative multi-task network solve the above three challenges respectively.It has achieved excellent emotion recognition effects on international open multimodal datasets such as IEMOCAP and MELD.The experimental results show the effectiveness of the work in this paper,and provide a reference and method basis for the artificial intelligence multimodal emotion recognition field.
Keywords/Search Tags:Computer Neural Network, Emotion Recognition, Multi-Modal, Feature Extraction, Multi-Modal Fusion
PDF Full Text Request
Related items