With the vigorous development of computer science and Internet technology,more and more people began to express and share their views on the network platforms.Properly analyzing and utilizing the sentiments in these viewpoints will help the realization of various services and applications in life,and has huge commercial value.Therefore,multi-modal sentiment analysis has gradually become a hot topic in academia and industry.The success of existing multi-modal sentiment classification approaches usually relies on a large amount of annotated data.However,considering factors such as collection and labeling costs of multi-modal data,and time or hardware-specific constraints in real applications,few-shot scenarios with only a small amount of labeled data are more common and realistic,which is ignored by existing research.To make up for the above deficiencies,this thesis researches approaches of image-text multi-modal sentiment classification in few-shot scenarios.The research can be divided into three phases,and the specific contents of each phase are as follows:Firstly,facing the current situation that existing approaches perform poorly in few-shot scenarios,this thesis proposes a few-shot sentiment classification approach based on multimodal task prompts,aiming at the proper training of newly introduced parameters in the modal.This approach reduces the size of newly introduced parameters in the model by adjusting the form of modality interaction and classification.Specifically,this approach first converts the image into features similar to textual embeddings and fills these visual features with textual embeddings into a pre-trained language model by filling a multi-modal template.Then,given a specific label mapping,this approach can obtain the sentiment label through the output token of the "[MASK]" features in the template.The experimental results show that this approach can surpass the performance of other existing uni-modal and multimodal approaches on the image-text multi-modal sentiment classification task in few-shot scenarios.Secondly,given the problem that the approach proposed in the previous research phase cannot fully understand the image information,this thesis proposes a few-shot sentiment classification approach based on an image caption pre-training,aiming at bridging the semantic gap between textual and visual modalities.This approach consists of two stages of pre-training and downstream fine-tuning,which can project textual and visual features to the same semantic space.Specifically,this approach adopts the image caption task to pre-train the image encoding module first and then leverages the multi-modal prompt-based model loaded with pre-trained parameters to fulfill the multi-modal sentiment classification task.Experimental results demonstrate that this approach can further improve the performance of multi-modal sentiment classification in few-shot scenarios.Finally,given the high pre-training cost introduced by the approach proposed in the previous research phase,this thesis proposes a few-shot sentiment classification approach based on a multi-modal rotation prediction pre-training,aiming at the automatic construction of matching image-text pairs for the pre-training.This approach constructs the corresponding description text through the rotation state of the image and realizes the automatic generation of the pre-training corpus.Specifically,this approach achieves the pre-training by replacing the image caption task used in the previous research phase with a multi-modal rotation prediction task which only needs to collect unlabeled images.Experimental results exhibit that this approach can significantly reduce the corpus and time cost of pre-training while maintaining a high-level performance. |