| Existing methods of utterance-level emotion recognition in conversations(ERC)usually require building context-sensitive and speaker-sensitive models because emotion generation theory suggests that human emotion not only depends on current utterances but also on the context.In addition,the behavior of interlocutors will also affect a person’s emotional state.Although these methods have achieved impressive results,machines still struggle to analyze emotions in human conversations.The critical factor is that humans usually rely on common sense to convey emotions.Therefore,many subsequent researchers integrated the external knowledge base into the conversational emotion analysis method and designed the knowledge-sensitive model.In context-sensitive models,previous work employed the recurrent neural network model represented by LSTM and GRU to model history utterances in time order to capture the context information.However,these works are still limited to the network’s storage capacity and challenge delivering long-term context information.Although researchers began to attempt the multi-hop memory network based on the attention mechanism to expand the network capacity,it also brought the shortage of high computational complexity.In speaker sensitivity models,the previous work uses the deep network model represented by a memory network and graph neural network to model the dependence between the speaker and himself and others.However,in the context of multi-party conversations,these works can not deal with the problem of cooperating multiple emotion analysis tasks.In knowledge-sensitivity models,previous work has enriched the utterance representation with the help of an external knowledge base to understand the common sense knowledge involved in human language.However,these works are still in the mechanized application of common sense.There is a lack of a mechanism to organically integrate common sense into the dynamic dialogue and interaction process.Based on the above problems,the research contents of this paper are as follows:Firstly,this dissertation studies a context-sensitive ERC method,which adopts a hierarchical self-attention fusion(H-SATF)mechanism to fuse the multi-model representation with different weights to highlight the critical modality data.Then,the Temporal Convolutional Network(TCN)captures the context information.The contextual self-attention(CSAT)network is utilized to balance the dependencies between the utterance itself and other utterances.In addition,a multi-branch memory(MBM)network is built to expand the storage capacity of TCN,which can effectively transfer long-term context information with lower computational complexity.The experimental results in the multimodal emotion analysis dataset MOSI show that the proposed context-sensitive ERC method has noticeable performance improvement compared with other advanced methods.Considering that the behavior of interlocutors will also affect other person’s emotional state and the defects of the existing speaker-sensitive model,this dissertation proposes a Multitask Graph Neural Network(MGNN)to cooperate with dimensional and discrete emotion classification tasks.This method also uses a TCN to capture the context information of the current utterance.Then,the directed graph structure between utterances is constructed,and the graph convolution network is used to model the context-dependence between the speaker and himself.In addition,a loss weight allocation strategy is proposed.It can adaptively adjust the loss weight of different tasks according to the "difficulty" of the task.The experimental results in different sizes and fields datasets show that the proposed speaker-sensitive ERC model achieves the best results compared with other advanced methods in the F1 measure.Considering that human beings usually rely on common sense to express emotion and the shortcomings of the existing knowledge-sensitive model,this dissertation proposes a knowledge-sensitive ERC method called Sentic GAT to solve these challenges.This method uses context-and sentiment-aware graph attention(CSAGAT)to select knowledge according to the context semantics and word-level sentimental consistency,adjusting the weight of external knowledge.Then,the dialogue transformer is used to capture the context information.Specifically,the transformer uses the hierarchical multi-head self-attention(MHAT)mechanism to explain the contextual utterances.In addition,to enhance utterance representation and make the model distinguish the context-sensitive and context-free utterances for emotion recognition,this dissertation proposes a supervised contrastive learning strategy to further pull the context-sensitive and context-free utterances away.The experimental results in different datasets show that the proposed knowledge-sensitive ERC model achieves the best results in the F1 measure compared to other advanced methods. |