Font Size: a A A

Research On Multimodal Video Sentiment Analysis Based On Fine-grained Annotations

Posted on:2024-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:X HuangFull Text:PDF
GTID:2568307139470924Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Multimodal Sentiment Analysis(MSA)is a hotspot in multimodal tasks.Unlike traditional sentiment analysis,MSA needs to integrate multiple modalities such as text,audio,and image data while analyzing sentiment.This approach is similar to how people express emotions in real life,as people use not only words but also intonation and gestures to convey emotions.However,there are still some limitations in the existing MSA research.Most studies only use multimodal annotation information to train the model,which prevents the uni-modal module from learning differentiated information.Furthermore,most research focuses on multimodal feature fusion methods while neglecting better modeling of single modalities,particularly acoustic and visual modalities.This thesis addresses the above issues by conducting research on fine-grained multimodal video sentiment analysis.The main contents and contributions of this paper are as follows:(1)To address the problem of underutilized uni-modal fine-grained annotations,we propose a multi-task-learning based MSA framework that uses fine-grained annotations to improve the differentiation of uni-modal modules,thereby enabling the model to achieve better generalization ability.The framework decouples the MSA task into the main task of multimodal sentiment regression and the auxiliary tasks of uni-modal sentiment regression,and introduces homoscedastic uncertainty weighting method in multi-task learning to balance the convergence of different tasks and reduce model uncertainties.(2)To address the problem of under-optimized uni-modal modules in MSA research,we propose a uni-modal feature extraction and multimodal feature fusion method based on Transformer encoder.The uni-modal Transformer encoder fully utilizes the temporal information in each modality to obtain high-level features,and then uses the multimodal Transformer to learn the interaction and fusion relationship between the high-level features of each single modality,thus obtaining powerful multimodal representations and improving the effectiveness of MSA.(3)To fully utilize fine-grained annotations,we introduce the MixGen method for the first time in MSA and make targeted modifications and optimizations.On this basis,we propose the SupMixGen method for interpolating and enhancing low-level features and emotion polarity annotation information for each single modality,enriching the dataset content and effectively improving the model’s robustness.(4)We conducted extensive experiments on the Chinese video sentiment analysis dataset CH-SIMSv2.0.We compared our proposed methods with other advanced MSA methods and conducted ablation experiments to verify the effectiveness of the methods proposed in the framework.The experimental results show that our proposed methods can effectively improve the effectiveness of sentiment analysis,gain superior result on many evaluation metrics,and outperform state-of-the-art models.
Keywords/Search Tags:Multimodal, Sentiment Analysis, Feature Fusion, Multi-Task Learning, Self-Attention Mechanism
PDF Full Text Request
Related items