| Emotion recognition in conversation is a hot topic in emotion recognition and a key subtask for implementing a conversation system with emotion understanding capability.More recent work has changed from using only unimodal data for emotion recognition in conversations to using multimodal data.However,the temporal misalignment of emotion features in multimodal emotion recognition tasks has not been well addressed.At the same time,most past work on emotion recognition in conversation has focused on determining emotion categories based on the characteristics of each utterance itself,lacking attention to some essential information in conversation,such as contextual information,the characteristics of emotion transfer,static characteristics and dynamic emotion states from both speakers and listeners in conversation.This paper focuses on multimodal emotion recognition tasks in conversation.The graph convolutional neural networks(GCNs)is used to fuse multimodal and capture contextual information while optimizing for the problem of temporal misalignment of emotion features.External knowledge is incorporated for speaker feature enrichment.Then on this basis,a two-sequence conditional random field(CRF)is proposed for sequential modeling based on emotion inertia in conversation to improve the emotion recognition effect.The main work in this paper is as follows:(1)A Multimodal Temporal Fusion Network(MTFN)based on the GCNs is proposed.Using LSTM to capture temporal contextual information and dynamic emotional states of speakers,then incorporating static disparity features of conversation participants.Modeling conversation contexts as graphs and using GCNs to fuse multimodal and capture global contextual information,especially cross-temporal and long-range contexts to extract emotional utterance features.The network is also optimized for the temporal misalignment of emotion features of different modalities,enabling the better fusion of different modalities.The effectiveness of MTFN is verified through comparison and ablation experiments.(2)The method of introducing external knowledge to enrich speaker features is proposed.Speaker features are extracted from text data,and the relationships between speakers are constructed as undirected graphs.The relational features are extracted using graph GCNs to enrich the conversational participant features in MTFN,and the method’s effectiveness is verified experimentally.(3)A method of emotion recognition based on emotional inertia in conversation is proposed.A two-sequence CRF model is proposed to model the conversation participants and the conversation as a whole separately in sequence for capturing the emotion transfer features.Comparison and ablation experiments are designed,showing that capturing emotional inertia using the two-sequence CRF model can effectively improve model performance.The method proposed in this paper outperforms similar models in most metrics of two datasets: IEMOCAP and MELD.The ablation experiments showed that the three methods presented in this paper are effective. |