| With the rapid development of internet technology,more and more people are inclined to share their daily life and express their opinions through video on social platforms.As time goes on,a large number of videos containing personal emotional tendencies have accumulated on these platforms.Analyzing these videos is not only beneficial for the government to supervise network public opinion,but also helps customers to make product purchase decisions.Due to the influence of noise and other factors,sentiment analysis based on uni-modal has poor robustness and is prone to ambiguity.In addition,the way humans express emotions is usually the result of multi-modal interactions.Therefore,multi-modal video sentiment analysis combines the data of multiple modalities to infer the emotions of characters in the video not only eliminates the noise problem of uni-modal,but also is closer to the real expression of human beings.This paper conducts research on multi-modal video sentiment analysis.The main research contents and contributions are as follows:(1)Research on video sentiment analysis based on multi-head attention and multi-task learning.To address the problems of insufficient modal fusion,high spatial complexity,and less consideration of the influence of speaker’s own attributes on emotion in existing crossmodal video sentiment analysis models,a cross-modal video emotion analysis model combining multi-head attention and multi-task learning is proposed.The improved Transformer encoder is used for modal fusion,capturing local features while also focusing on global salient features,so as to enhance the correlation and complementarity between modalities.By simplifying the combination form of modal fusion,the model space complexity and training cost can be reduced.In addition,an auxiliary task of speaker gender recognition is added to the main task of emotion classification,and the model is constrained by the loss function of this auxiliary task.Experimental results on two public datasets show that the proposed model is effective.Except for individual models,compared with the most models,this model achieves better performance improvement while reducing the spatial complexity of the overall model.(2)Research on multi-modal video sentiment analysis combining personality and common features.To solve the problems that most of the existing datasets lack independent uni-modal sentiment annotations,which cannot effectively capture the difference information of each modality,and few studies consider the influence of uni-modal personality characteristics on multi-modal common features,a multi-modal video sentiment analysis model combining personality and common features is proposed.The difference in data distribution between common and personality features is exploited to generate unimodal annotation,allowing the model perform multi-task learning to improve the generalization performance of the underlying parameters.At the same time,a cross-modal enhancement mechanism is proposed to enhance the information of multi-modal common features by unimodal personality features.Experimental results show that the proposed model is effective.The inclusion of a uni-modal emotion recognition task can effectively improve ability of model to capture feature differences,and the use of common features to generate uni-modal sentiment pseudo-labels and the use of cross-modal enhancement mechanisms both help to improve the performance of multi-modal sentiment analysis. |