| As an important research direction of human-computer interaction,sentiment analysis has important applications in various fields of society,such as public opinion analysis,personalized recommendation,financial forecasting and so on.Due to the development of communication technology,multi-modal data has gradually become the mainstream of social platforms,and multi-modal content can publish more information than text.People are accustomed to express their opinions in the form of text,image,audio and video,which makes the original text-based single-mode sentiment analysis unable to meet the existing needs of sentiment analysis.In the field of sentiment analysis,multi-modal data has obvious advantages.Firstly,multi-modal data has more information content than single-modal data,and secondly,there is interaction between modes in multi-modal data,which is conducive to more accurate judgment of the emotional polarity of multi-modal data.However,how to mine the rich information of multimodal data and how to learn the interaction between multimodal data are also problems that need to be solved in multimodal sentiment analysis.Therefore,for the unresolved issues of the appeal,this thesis makes the following contributions:(1)In this thesis,multi-level attention mechanism is used to extract the information of multimodal data in all aspects.The text features and feature features are extracted in the single-mode encoder using the network based on multi-head self-attention mechanism,which fully pays attention to the global context information of multimodal data.In the multimodal fusion encoder,a cross-modal attention mechanism is used to fuse text features and image features,emphasizing the area of common interest between text and image,so that the model can learn the interaction between text mode and image mode.(2)Most multimodal emotion analysis models do not pay attention to the alignment between modes in the process of feature extraction,which leads to the increase of redundant information in modal fusion and the problem that it is not conducive to the interaction between modes in model learning,the text first introduced in single-mode encoder part Contrastive Loss,the text features and image features are aligned after extraction respectively,which improves the effect of single-mode feature extraction and lays a good foundation for multi-modal feature fusion.The advantages and disadvantages of early fusion and late fusion in multimodal fusion strategy can not be ignored.Based on this,this thesis also proposes a hybrid fusion strategy including early fusion and late fusion,which improves the generalization ability of early fusion and late fusion. |