In recent years,more and more users begin to carry their emotions through multi-modal media such as text and image or short video.However,the traditional sentiment analysis model is often designed for the single modal media such as text or picture,so it is difficult to effectively use the multi-modal information,resulting in the accuracy of sentiment classification of multi-modal information is not high.Through the input of information of different modes,the deep learning technology is used to automatically excavate the emotional states of different modes,effectively combining the information of multiple modes,and thus improving the ability of sentiment analysis.Based on the innovation of intra-modal emotion feature modeling and cross-modal feature fusion,this paper studies the multi-modal sentiment analysis of text-image and video.The main research contents are as follows:(1)Aiming at the problem that the existing models of image emotion analysis only consider the relationship between high-level features and text features,but ignore the lower-level features of images,a new model of text and image sentiment analysis based on multi-layer cross-modal attention fusion is proposed.Firstly,the VGG network was connected with multi-layer convolution to obtain image features of different levels,and BERT word embedding and Bi-GRU were used to obtain text emotion features.In order to enable the model to focus on the important information related to the text content,the extracted multi-layer image features are fused with text features to obtain multiple groups of single-layer text-image attention fusion features,which are assigned weights through the attention network.Finally,the obtained multi-layer text-image attention fusion features are input into the full-link layer to obtain the classification results.The experimental results show that compared with the baseline model,the accuracy and F1 value of MAFSA model are improved,which effectively improves the performance of graphic emotion classification.(2)In view of the problems that single-mode feature heterogeneity is difficult to retain in feature extraction of video sentiment analysis model and feature redundancy in cross-mode fusion,a video sentiment analysis model(MTSA)based on multi-task learning and Cascade Transformer is proposed.The MTSA model uses LSTM and the multi-task learning framework to extract the semantic information of single-mode context.By accumulating the auxiliary modal task loss,the noise is removed and the modal feature heterogeneity is preserved.Multi-task gating mechanism is used to adjust cross-modal feature fusion,and text,audio and visual modal features are fused in a cascade Transformer structure to improve the fusion depth and avoid fusion feature redundancy.Gradnorm and subtask weight attenuation are used to optimize multitasking losses and balance multitasking training Experimental results show that MTSA can effectively improve the performance of video sentiment analysis. |