Font Size: a A A

Study On Multimodal Emotion Recognition Based On Deep Learning

Posted on:2021-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:D D JiangFull Text:PDF
GTID:2518306107989749Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,a large number of researchers have done sentiment analysis on text,voice and facial expressions.Human emotions are very complex,and the way to express them is very diverse.Therefore,considering the characteristics of different modalities at the same time is particularly important for accurately judging the tendency of emotions.Most of the current researches focus on single-mode or dual-mode emotion recognition,but the accuracy is not high.To solve this problem,this thesis studies multimodal emotion recognition for text,audio and visual.The innovation of this thesis mainly has the following two points,summarized as follows:(1)In this thesis,a multi-level contextual multimodal emotion recognition model IEF-BiGRU based on information enhancement is proposed.In this model,the information enhancement method is used to enlarge the more important modal information in the process of multimodal fusion,and the recurrent neural network is used to extract different contextual features before and after multimodal fusion.To a certain extent,this model solves the problem that the traditional multimodal feature fusion method based on cascade may appear dimension explosion and does not consider the different importance degree of different modes.Besides,the model also solves the problem of different context information before and after multimodal fusion is ignored in previous models.Compared with the cascaded multimodal feature fusion method,the accuracy and F1 score of the IEF-BiGRU model are improved in both CMU-MOSI and IEMOCAP datasets.In the case of audio and visual mode fusion on the dataset of IEMOCAP,the accuracy is improved by 15.78%,and the F1 score is improved by 18.76%.(2)This thesis proposes a multimodal emotion recognition model IEFATF-BiGRU based on attention mechanism and aggregation mechanism.This model can amplify the contribution of context which is more relevant to the target discourse,and can aggregate information of different levels and granularity to complement each other.This model improves the IEF-BiGRU model,which ignores the situation that the context of the target utterance is different from its relevance degree when extracting context features,and the problem that information may be lost in the process of training from low level to high level.Experimental results show that compared with the IEF-BiGRU model,the accuracy and F1 score of the IEFATF-BiGRU model on CMU-MOSI and IEMOCAP datasets are improved.In the case of three-mode fusion on the CMU-MOSI dataset,the accuracy is increased from 81.52% to 83.06%,and the F1 score is increased from 81.42% to 83.02%.At the same time,IEF-BiGRU model and IEFATF-BiGRU model are superior to several existing advanced models,and have better emotion classification effect.In conclusion,the validity of the model proposed in this thesis is verified by experiments and analysis.
Keywords/Search Tags:Multimodal Emotion Recognition, Multimodal Fusion, Information Enhancement, IEF-BiGRU Model, IEFATF-BiGRU Model
PDF Full Text Request
Related items