Font Size: a A A

Outlier Detection In Heterogeneous Information Networks

Posted on:2018-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:M X YaoFull Text:PDF
GTID:2428330545461184Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
It is a huge challenge to find latent and valuable knowledge from massive data.It is sometimes very important to identify the data that is significantly different from the rest,namely outlier detection.Majority of existing outlier detection algorithms are for high dimensional data,uncertain data,data flow,and time series data.Until recently,there have been an increasing number of outlier detection studies for information networks.Information networks,especially heterogeneous information networks,have complex structure and rich information due to the diversity of vertices and edges,which brings more challenges for outlier detection.In this thesis,we define the concept of outliers with abnormal correlation in heterogeneous information networks,where abnormal correlation manifests in the abnormity of attribute characteristics and connection characteristics of the associated vertices in networks.Then we extend the current query language framework in order to apply to the outlier detection study in this thesis.Based on the correlation,we proposed a new outlier detection algorithm called CBOut.In this algorithm,users have the freedom to determine the type of outliers and the criteria that used to measure whether the vertices are outliers.The CBOut algorithm calculates the similarity matrix of vertices in the network by using the new similarity measure method,and subsequently gets clusters based on the affinity prorogation clustering method.In the end,all vertices within the small-scale clusters are outliers.The experimental results demonstrate that our method can detect outliers proposed in this thesis effectively in synthetic dataset and real dataset.Under single measure criterion and multiple measure criteria,different similarity measure methods are proposed to calculate the similarity matrix of the vertices in the network.In the case of single measure criterion,this thesis proposes a new method to optimize the similarity calculation for multiple queries.This method applies the least frequently used replacement strategy based on the path length to store the eigenvectors of associated vertices selectively.It can reduce the time for similarity calculation in multiple queries when the number of eigenvectors is limited.In real dataset,the experiments show good performance of the optimization algorithm.In the case of multiple measure criteria,the similarity measure method used in the CBOut algorithm needs to assign different preference weights to different measure criteria.Based on domain knowledge,users can specify the preference weights for different measure criteria in the query language.When users cannot explicitly give the preference weights,this thesis also proposes a weight adaptive adjustment mechanism to get the preference weights that are in accordance with the network characteristics.If the setting of preference weights can lead to higher clustering quality,it also can lead outlier detection more precisely.In synthetic dataset and real dataset,the experiments verify that the weight adaptive adjustment mechanism can improve the clustering quality after weights adjustment,and then improve the precision of outlier detection.
Keywords/Search Tags:Heterogeneous information networks, abnormal correlation, outlier detection
PDF Full Text Request
Related items