Font Size: a A A

Research On Self-Similarity Join In Heterogeneous Networks

Posted on:2014-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhouFull Text:PDF
GTID:2298330434466149Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Similarity join is a fundamental operation required in many research fields, and numbers of practical problems can reduce to it. It is attracting attention from various applications on networks and linked data, such as link prediction, community detection and online advertising. However, few works have studied the similarity join problem in heterogeneous networks. Especially, none takes different semantic meanings behind paths into consideration and almost all completely ignore the heterogeneity and diversity of the networks; hence, they cannot be directly applied to the similarity join problem in heterogeneous networks.To capture the heterogeneity of networks, we propose a solution for Path-based Similarity join (PS-join) in heterogeneous networks. PS-join returns the top k similar pairs of node objects from the target type based on any user spec-ified join path in a heterogeneous network. Under this framework, we study two concrete problems, i.e., exact similarity join (ePS-join) and approximate similar-ity join (aPS-join) respectively. For exact similarity join, we improve the existing PSR method which is used to answer the k closest pairs query in spacial databases, resulting in a R*-tree based algorithm (denoted as ePS-join). Although the im-provement is effective, the drawback of R*-tree limits ePS-join’s application in high dimensional spaces. Therefore, to address the high-dimensionality and mag-nanimity of heterogeneous networks, we reduce the similarity join problem to an approximate version, i.e., aPS-join. We introduce the LSH indexing technique to build up an indexing structure for an input data set and then do similarity join in nearby buckets. We call this method as NBLSH-based aPS-join (aPS-join based on LSH using Nearby Buckets). Moreover, we investigate the properties of our similarity measure and propose a novel pruning strategy to further reduce the candidates. This optimized method is referred to as BPLSH-based aPS-join (aPS-join based on LSH using Buckets Pruning).Compared with existing Link-based Similarity join (LS-join) method, aPS-join can derive various similarity semantics. Experimental results on real data sets show the efficiency and effectiveness of our approach.
Keywords/Search Tags:Similarity join, Heterogeneous network, Join path, R~*-tree, LSH
PDF Full Text Request
Related items