| With the development of Internet infrastructure and the popularity of mobile applications,user bases of short video platforms such as Tik Tok and Kuaishou continue to expand.People’s usage habits have shifted from text-based platforms such as Tieba and forums to short video platforms,resulting in a massive amount of multimodal data.Traditional sentiment analysis methods designed for single modality data are no longer fully applicable to multimodal data,making the field of multimodal sentiment analysis widely researched by many scholars.Multimodal sentiment analysis aims to characterize and integrate by extracting features from various modal data,such as text,video,and speech,to further determine the sentiment polarity.In recent years,thanks to the rapid progress of deep learning technology,multimodal sentiment analysis tasks have made good development.This article focuses on researching difficult issues such as multimodal heterogeneous interaction,noise information,and feature fusion,and the main innovations and research contents are as follows:(1)To address the problem of inefficient modal fusion representation in modal feature unaligned scenarios,a multimodal sentiment analysis model based on modal cross-mapping and consistent comparison is proposed.The model is designed to map between modalities by Mapping Attention Coding Layer(MACL),and features do not need to be aligned.the MACL module discards the decoder of Transformer and improves the encoder,and every single modality can obtain the information of the remaining modalities through the MACL module to compensate for the modal information missing.Secondly,the model uses Transformer for modal feature extraction to improve the long-range dependence of modalities and focus more on contextual information.Finally,the model robustness is improved by designing the modal consistency comparison task to focus on modal consistency.(2)To address the problem of noisy information caused by modal heterogeneity,a multimodal sentiment analysis model based on modal contribution recognition and multi-task learning is proposed.The model contains a modal contribution recognition module,modal fusion,and multitask learning unit.Specifically,the model first designs a linguistic modal gain detection module to identify and utilize visual and audio information to reduce the noise of modal information.Second,the modal information is enriched using cross-modal attention.Finally,joint learning of unimodal and multimodal is performed to explore the optimal solution for multimodal output.Better sentiment classification results are obtained on two general English datasets,CMU-MOSI and CMU-MOSEI,demonstrating that the model can effectively exploit the multimodal contribution gap to reduce the modal noise information and improve the task performance.(3)To address the problem of coordinating the consistency and specificity of modalities and enriching the fused feature information,based on the research of the first two models,a multimodal sentiment analysis model based on dynamic screening of modalities and heterogeneous discriminations is proposed.The model performs modal fusion control by designing a dynamic filtering network.The dynamic filtering network discriminates the similarity between two modalities by projecting the features into the subspace and uses the screening algorithm to ensure the fusion quality and enhance the consistent expression of the modalities.By constructing a heterogeneous discriminant network,we can efficiently utilize the single modal information and discover modal-specific information to improve the model generalization performance. |