Font Size: a A A

Research On Recommendation Algorithm Based On Multimodal Information

Posted on:2024-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Y HanFull Text:PDF
GTID:1528306944966619Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet technology,the data in the Internet have shown explosive growth,and more and more information services are flooding the network.Although people can browse all kinds of information in the network,it also increases the difficulty of people seeking the target information.In the face of massive Internet information,people often cannot quickly and accurately find the information they need,so how to extract the valuable information for each user from the massive data has become the focus of research.Then personalized recommendation comes into being.The task of personalized recommendation is to recommend items or products for each user by analyzing the user’s historical behaviors and modeling the user’s interest preferences.Multimodal information of items has important impact on users when they are shopping.To make full use of the multimodal features of items and model users’ interest preferences accurately,this paper mainly studies the recommendation algorithm that integrates multimodal information.Firstly,we introduce the background of recommendation algorithms,then we focus on the problems that arise in the process of multimodal fusion,including learning modality-disentangled item representations,modeling multi-level interactions among modalities and solving the adaptive problem of multimodal information in sequential recommendation.The main work are as follows:(1)Aiming to disentangle the common features of multimodal data and preserve their specific characteristics,we propose a pretraining framework,which is a pretraining modality-disentangled representations model.Firstly,a disentangled encoder is designed to automatically extract modality-common characteristics while preserving modality-specific characteristics,and the distance of representations in vector space is used as the supervision signal.Secondly,different decoders are designed to decode the representations to the corresponding modal space,and the supervised signal is designed by contrastive learning.The model is pre-trained to obtain the modality-common representations and modality-specific representations.Finally,these representations are used to the downstream tasks,and the experimental results show that the recommendation performance improves.(2)Aiming to model the multi-level interactions among multimodal features in the process of multimodal information fusion,we propose a multimodal interactive network based on self-attention and cross-attention.The model integrates the feature interactions of different levels between multi-modalities into a unified framework,including the cross-modal interaction of item-level based on encoder-decoder,the internal interaction of stream-level based on self-attention,and the cross-modal interaction of stream-level based on cross-attention.Our model adopts multi-task training style.Experimental results show that modeling multi-level interactions between modalities can improve the recommendation performance.(3)Aiming to solve the adaptive problem of multimodal information in the process of multimodal information fusion,we propose a multimodal-adaptive hierarchical network.The model contains a two-layer recurrent neural network and an information modulation module.The information modulation module is designed between the two-layer recurrent neural networks.It realizes the selection of each modal information at the current time step according to the previous time steps.The model optimizes parameters by multi-task training,and experimental results show that the model improves the recommendation performance.(4)Aiming to solve problem that the fine-grained features of visual modality cannot be extracted,we propose a feature extraction module based on contrastive learning.In the existing multimodal commodity datasets,the item image has no shape and color labels,so we first construct two enhanced datasets by converting the color of the item and clipping the central area of the image.We design the supervision signal by contrastive learning and extract the color and shape features of visual modality.Then we use word embedding to process the textual modality.Finally,we design an edge disentangling graph neural network for multimodal sequential recommendation.Experimental results show that the model improves the accuracy of recommendation.
Keywords/Search Tags:multimodal recommendation, multimodal fusion, feature interaction, pre-training, multi-task learning
PDF Full Text Request
Related items