Font Size: a A A

Research On Semi-supervised Learning Methods On Heterogeneous Information Networks

Posted on:2016-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:C LuoFull Text:PDF
GTID:2308330467998921Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Information network mining is an important research area of data mining and machinelearning. By classification, clustering and other operations on the information network, wecan find a lot of important information. In addition, the results of the classification andclustering can also be used to go to other areas, such as recommendation systems, informationsecurity, or other areas of information retrieval.Today, the current researches on the information networks are based on the homogeneousinformation networks, which only have one type of nodes and edges. However, in the realworld, all data is heterogeneous, and information networks are composed of heterogeneousinformation networks. Therefore, how to effectively carry out data mining tasks onheterogeneous networks is a challenging work.Semi-supervised machine learning, distinct from supervised and unsupervised machinelearning methods, does not need a lot of training samples and also can have the way to dealwith the heterogeneous information. Therefore, semi-supervised machine learning is the bestway of mining heterogeneous information networks.Therefore, haven notice such feature of the semi-supervised machine learning methods, inthis paper, we will investigate the semi-supervised machine learning (semi-supervisedclassification and semi-supervised clustering) methods on heterogeneous informationnetworks.The main challenges of this work are as follows:1. How to deal with heterogeneous information in heterogeneous information is one of thechallenges of this paper. Different from the traditional homogeneous network,heterogeneous information networks have variety types of nodes and edges. Differenttypes of nodes or edges have distinct semantic meaning. Therefore, how to effectivelynormalize information, and avoid the inconvenience is one of the difficulties of this paperwork.2. How to deal with the large-scale heterogeneous information networks and makeefficiently mining is one of the challenge work in this paper. With the advent of theinformation age, data is not only becoming more and more diverse, the size of data becomes larger and larger. How to mine heterogeneous information network bothefficiently and efficiency is one of the challenge works of this paper.3. The criterion of classification on heterogeneous information networks is largely differentfrom the one on homogeneous information networks. In heterogeneous informationnetworks, different types of nodes will have different types of classification criteria.Therefore, how to effectively solve the heterogeneity of this classification criterion, is oneof the challenge in this paper.For these three questions, in this paper, we them as follow:1. For the heterogeneous information network modeling challenge, we will use therelationship path or element to model heterogeneous relations in heterogeneousinformation networks.2. For large-scale data mining challenge, in this paper, we will use linear machine learningmodels to deal with the relationship extraction on heterogeneous information networks.Therefore, we can avoid the time loss of using large-scale machine learning models in thetraining and test process.3. The criteria for the classification and clustering problems, we will only consider the sametype of node clustering and classification when the classification and clustering,classification or clustering criteria to avoid the above differences.In summary, the main work of this paper is as follows:1. In this paper, we present a semi-supervised classification algorithm, HetPathMine, onheterogeneous information networks. Different from the traditional semi-supervisedclassification problem, HetPathMine uses the relation path to represent the relationshispon the heterogeneous information networks. On the other hand, HetPathMine also has ahuge advantage in addressing the problem of different nodes in heterogeneous informationnetworks may have different types of classification criteria.2. We propose a semi-supervised clustering algorithm PathSelClus, different from thetraditional method of clustering on heterogeneous information networks. We do not needto pre-assign the cluster number before the clustering process. Therefore, this largelyimproves the PathSelClus in real industrial availability environment.3. We have test the proposed two algorithms on both synthetic data and real world data(DBLP Data and Meetup Data), the experimental result demonstrate the effectiveness andefficiency of our methods.
Keywords/Search Tags:Semi-supervised Classification, Semi-supervised Clustering, Heterogeneous InformationNetworks
PDF Full Text Request
Related items