| According to the analysis of the 2022 Global Digital Insights Report released by We Are Social and Hootsuite,social media users worldwide have now reached 4.62 billion,accounting for approximately 58.4% of the world’s population,an increase of more than10% from the same period last year.In this era of booming social media,more and more users are sharing their personal views or experiences with others in the form of pictures and texts through their social accounts.This kind of multimodal data with the rich sentiment of users is beneficial to understanding people’s views on a certain event and has great application value in political elections,general market trends,box office prediction,and emotional intervention.Consequently,how to effectively extract sentiment information from these graphical data has been widely studied.As a result,multimodal sentiment analysis has received more and more attention and research from scholars,which is the direction of this paper’s research.Nevertheless,there are still some shortcomings in the existing research.Due to the abstraction and complexity of sentiment,the sentiment of an image is often represented by multiple subtle local regions.In the existing attention-based multimodal sentiment analysis methods,it is often impossible to obtain accurate and complete sentiment regions by only one attention.Secondly,the sentiment information contained in the colors is not fully explored.Furthermore,existing studies mostly use Glo Ve and BERT as pre-training models,and the word vectors generated by Glo Ve are static and cannot handle multiple meanings of words.BERT does not fully utilize the lexical structure,syntactic structure,and semantic information in the data,and this problem is especially obvious in Chinese.In addition,multimodal feature fusion is used to construct multimodal joint representations with deterministic operations,which cannot effectively utilize multimodal information.(1)Inspired by the human perceptual process,this paper simulates this process through a recurrent attention mechanism.It can locate these emotional discriminative regions more accurately and completely based on past information and ignore some irrelevant information.(2)Inspired by the art of photography,we focused on the effects of saturation,hue,and lightness on sentiment analysis and verified their validity in sentiment analysis.(3)Most fusion techniques construct joint multimodal representations with deterministic operations,which cannot effectively utilize multimodal information.In this paper,we utilize an auto-fusion network to extract multimodal features by maximizing the correlation among the modalities,which solves the problem of information redundancy caused by deterministic operations.(4)To address the shortcomings of Glo Ve and BERT,we adopt the ERNIE pretraining model,which greatly enhances the generic semantic representation by modeling lexical structure,syntactic structure,and semantic information in a unified manner.(5)In the MSA-HPP method,the influence of visual features and color features on sentiment analysis is considered,but the fusion of features at different levels of different information is also an important part of determining the sentiment analysis.Furthermore,the fusion part of global features and local features is only a simple concatenation operation.To address the above problems,we propose a multimodal sentiment analysis model based on adaptive gated information fusion(AGIF).We first adaptively weighted fusion of different levels of visual and color features extracted by Swin Transformer and CNN based on their contributions to sentiment analysis through a gated information fusion network.Then,the different contributions of global and local features to the sentiment analysis are weighted and fused utilizing gated units for learning.A series of experiments on four datasets fully demonstrate the superiority and effectiveness of our proposed two algorithms. |