Font Size: a A A

Research On Heterogeneous Information Net-work Representation Learning Algorithm

Posted on:2023-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhanFull Text:PDF
GTID:2530306800489124Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the scale of data generated in human society is increasing rapidly.The complex nature of different types of objects and the relationships between them form huge and diverse heterogeneous information networks(HIN).How to mine the knowledge contained in these HIN has become an urgent problem.Therefore,it is necessary to research the HIN representation learning methods to improve the storage and analysis efficiency of the HIN.Methods in the early stage use the network structure proximity as self-supervised information to guide sampling on the HIN,then learn the dense vector as the representation of nodes through the distribution of the samples.However,they ignore the imbalance between the effective information and redundant information introduced in the sampling process.So is the imbalance between the over-representation of the central node and the under-sampling of the terminal node.Thus,the sampling results can not accurately reflect the characteristics of the original HIN.As the performance of the representation results is directly affected by the collected samples,it is necessary to avoid the imbalance.Recently,the HIN representation learning field has also begun to explore the Graph Neural Network(GNN)based models.However,most of them only rely on the local structure information of the network without introducing the global distribution information of the nodes.Because the global distribution of objects in low-dimensional space is conducive to the extraction of more discriminative features by the deep neural network,it is interesting to explore how to use such information to guide the heterogeneous GNN in learning the nodes.Given the above problems in the HIN representation learning,our main achievements are as follows:1.Facing a tremendous amount of related works of HIN representation learning,we propose a unified learning paradigm based on foreign reviews and research.Based on the technical route adopted by the paradigm,the existing work is divided into two fields: the model based on structural similarity and the model based on GNN.Furthermore,we separate the processing of network data and the learning of the representation vector.Then we classify and elaborate the typical methods corresponding to these two parts respectively.Based on this review,we summarize the achievements and problems of existing methods and point out the direction for future work.2.Aiming at the problem of imbalance sampling in traditional proximity-based HIN representation learning methods,we propose the Coar SAS2 hvec in this paper.Coar SAS2 hvec firstly samples the local context information of each node through the short-distance heterogeneous random walk to avoid information redundancy caused by self-pairs by skipping the starting node.On this basis,Coar SAS2 hvec uses node degree distribution to balance the number of walks of each node and fully samples the context information under different structure distributions.At the end of each round of sampling,Coar SAS2 hvec introduces a coarsening process to remove a proportion of the over-represented nodes in the node sequences,to maintain the balance of the representation between the center node and the terminal node.Furthermore,we introduce a type indicating matrix into the loss function of the traditional network representation learning methods,so that the learned low-dimensional vectors of nodes can preserve the heterogeneous relationship information.Experimental results show that the node representation learned by Coar SAS2 hvec can achieve better results on downstream tasks compared with the baselines.Further analysis shows that the results sampled by Coar SAS2 hvec have higher information entropy and can also learn better results than the competitors with the traditional loss function,which fully demonstrates the importance of the imbalance issue and the effectiveness of Coar SAS2 hvec.3.Although Coar SAS2 hvec solves the imbalance issue in sampling the HIN,this method can only be applied to the traditional sampling-representation-analysis three-step scenario and cannot solve the problem that the heterogeneous GNN lacks the global distribution information of the nodes as a guide in the current popular end-to-end learning paradigm.Therefore,we propose a regularization method based on coding rate compression(CRC-Reg)in this paper.CRC-Reg takes the output of the model at the end of each training epoch as the global distribution information of nodes in probability space.By compressing the Coding Rate of the output,CRC-Reg can make similar nodes distributed closer in the probability space,thus improving the accuracy of classification.On this basis,CRC-Reg can be further extended to more general classification scenarios as it does not specify the original input form of self-supervised information.Experimental results of node classification in HIN verify that CRC-Reg improves the performance of existing heterogeneous GNN models.In the experiment of homogeneous network node classification and image classification,the performance of the commonly used GNN and deep neural network models are improved after loading CRC-Reg,which further indicates the wide applicability of CRC-Reg.
Keywords/Search Tags:Heterogeneous information network, Representation learning, Random walk, Network sampling, Coding rate
PDF Full Text Request
Related items