Font Size: a A A

Data-driven And Knowledge-guided Video Affective Computing

Posted on:2022-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:F QiFull Text:PDF
GTID:1488306560453654Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer science and the increasing demand for personalized human-computer interaction,affective computing in human-computer interaction is becoming more and more significant.The next breakthrough in artificial intelligence may be automatic affective computing.Traditional human-computer interaction,mainly carried out by keyboard,mouse and screen,only pursues convenience and accuracy,and cannot understand and adapt to human emotions or state of mind.Moreover,people's feelings directly impact on decision-making,and emotional capabilities are crucial to the natural interaction between computers and people.Therefore,video affective computing is of great significance.However,the multimodality of video data,the characteristics of domain differences,the affective gap between the low-level visual content and the viewer's emotion,and the dynamic development of emotion theory with fine-grained emotion categories all bring challenges to video emotion computing.To address these challenges,this paper investigates video emotion representation learning and its applications,and the main work of this paper is as follows:1)To address the multimodal nature of video data and to solve the domain discrepancy problem,we define and study a multimodal domain adaptation framework for affective representation learning for the first time.The concept of multimodal domain adaptation is defined for the first time.Moreover,a flexible multimodal domain adaptation framework is proposed,which can be applied to video affective representation learning and other multimodal domain adaptation tasks,such as crossmodal retrieval.To exploit the complementarity of different modalities,we propose a covariant multimodal attention module that learns the salient parts of each modal feature that are most helpful for modal fusion and contains a structure-sensitive combination mechanism that can capture the global structural information of each modality.We propose a hybrid domain joint constraint that constrains the adversarial loss function to the original modal features in the source and target domains.The attention-weighted features and fused features to learn are discriminative and domainadaptive multimodal video sentiment representation.2)Due to the time-consuming and labor-intensive video sample annotation,the lack of large-scale training datasets,and the lack of viewer physiological signals,we propose a novel knowledge-driven affective representation learning to bridge the affective gap.For the first time,we bring external emotion knowledge graphs to the video task.To build a visual emotion knowledge graph,we employ visual objects in the video and emotional concepts as nodes.The edges are extracted from large external knowledge graphs,Concept Net and Sentic Net.We feed our emotion knowledge graph to siamese graph convolutional neural networks and learn an emotion-aware representation.Finally,we evaluate the learned emotion-aware representation for the video highlight detection task on two standard datasets.Extensive experimental results demonstrate the robustness of knowledge-driven emotion representation learning3)We investigate representation learning for zero-shot video emotion recognition to address the dynamic development of emotion theory with more fine-grained sentiment categories.For the first time,We propose a flexible framework for zero-shot video emotion recognition,which is equipped with a novel visual protagonist representation learning method.A dynamic contextual emotion-aware attention module is capable of selecting the protagonists.To better align with unseen emotion labels,we learn a multimodal affective embedding space with a noise contrastive estimation objective.We verified the proposed method on three video emotion datasets with different fine-grained label spaces.
Keywords/Search Tags:Video Affective Computing, Multimodal Representation Learning, Domain Adaptation, Zero-shot Learning
PDF Full Text Request
Related items