Font Size: a A A

Research On Optimization Techniques For Fervasive Structural Similarity Computation On Large Scale Networks

Posted on:2013-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:W R YuFull Text:PDF
GTID:1228330395481289Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Similarity assessment between objects is one of the important research topics in the field of data mining. Over the past decade, motivated by the well-known Google PageRank, the study of pervasive hyperlink-based similarity models has become increasingly popularized throughout the world. Among the hyperlink-based similarity models, SimRank, SimFusion and Penetrating-Rank (hereinafter referred to as P-Rank) have emerged as the three promising ones which have received a growing interest in both domestic and international disciplines. Nowadays, the intractable scale of the Internet has led to an increasing need for optimization techniques to efficiently handle similarity computations on large networks, due to their computational difficulties. However, the state-of-the-art work on SimRank, SimFusion and P-Rank often lacks the optimization approaches that encompass (i) the reduction of computational time and space,(ii) accuracy and stability analysis, and (iii) scalability. This dissertation aims to narrow these gaps. The study of the optimization approaches to these structural similarities is the cornerstone of the hyperlink-based analysis, with a plethora of real applications, including web page ranking, collaborative filtering, web search engine optimization, graph clustering and so forth.In this dissertation, the optimization issues of computing the three link-based similarity models, i.e., SimRank, SimFusion, P-Rank are well addressed, with the aim to (i) guarantee the accuracy for approximation techniques,(ii) speed up the convergence rate of the conventional iterations,(iii) reduce the computational time and memory space,(iv) parallelize the existing algorithms, and (iv) support incremental updates. More specifically, the study will mainly focus on the key techniques for optimizing similarity computations via matrix transformation based on the self-referentiality of the three similarity models, thus improving the algorithmic performance and taming the computational complexities. Moreover, we will also improve the existing similarity models, and devise the novel algorithms for structural similarity computation, with accuracy guarantee, high stability, as well as low CPU time and space complexity. Our main contributions are summarized below.1) We propose optimization approaches to computing SimRank similarity. By rewriting the SimRank equation into a matrix form, a novel iterative paradigm is devised for SimRank computation, which can exponentially accelerate the convergence rate of SimRank solution. By building a Krylov subspace, the computation of the original SimRank equation (n×n dimension) can be projected onto this low-order subspace (r×r dimension), hence significantly reducing the computational time from O(Knm) to O(rm+r2n+Kr3), and the memory space from O(n2) to O(rn), where n and m are the total numbers of vertices and edges in a graph, respectively, r is the rank of the graph adjacency matrix, and K is the total iteration number. To further improve the time complexity of SimRank computation, optimization techniques for SimRank on undirected graphs are also suggested to decompose the eigenvectors of the adjacency matrix for further reducing the running time to O(rm+Kr2). A parallel algorithm is also proposed for SimRank on undirected graphs.2) We improve the SimFusion model and optimize the algorithms of similarity computation. To deal with the two existing problems of the naive SimFusion model, i.e., divergent solutions in homogeneous networks and trivial solutions in heterogeneous networks, an improved version of SimFusion model, refereed to as SimFusion+, is developed to ensure the existence and uniqueness of SimFusion+solution by introducing the unified adjacency matrix (UAM) and by delaying its row-normalization. To address the optimization issues of SimFusion+, the dominant eigenvector of UAM is utilized to characterize the SimFusion+solution, thus greatly improving the computational time from O(Kn3) to O(n2), and the memory space from O(n2) to O(n). Furthermore, SimFusion+approximation and parallel algorithms are also exploited for computing similarities over large scale graphs and evolving graphs, respectively, which address the SimFusion+scalability problem.3) We optimize P-Rank similarity computation. Firstly, P-Rank similarity is characterized as two matrix forms, i.e., the matrix inversion form and the power series form. Based on its matrix inversion form, a singular value decomposition of the graph transition matrix is deployed for deterministically computing P-Rank with accuracy guarantees, which reduces the computational time from O(Kn4) to O(v4n2+v2n), where v(<<n) is a speed-accuracy trade-off. Based on the P-Rank power series form, a scalable probabilistic P-Rank algorithm via a Monte Carlo sampling approach is proposed to further reduce the time complexity to O(Nn), with N being the sample size. Besides, the notion of a P-Rank conditioning number is introduced, and a strigent upper bound is obtained for this number to anlaze P-Rank stability. For undirected graphs, a non-iterative P-Rank algorithm is also proposed to further reduce the time to O(vn2) via matrix diagonalization.The empirical evaluations on both synthetic and real-life datasets demonstrate that the optimization techniques on SimRank, SimFusion and P-Rank can achieve better time and space complexity with the desired accuracy, high stability and scalability; their computational efficiency outperforms the baselines by orders of magnitude on large graphs. Moreover, the parallel and incremental algorithms can further improve the computational efficiency on multi-processors and evolving graphs, respectively. In summary, our technical contributions may provide better solutions and theoretical analysis for efficiently computing structural similarities, which can be applicable to many domains such as web search engine optimization.
Keywords/Search Tags:link-based similarity, SimRank, SimFusion, Penetrating-Rank, large graphs, algorithm optimization, time and space complexity, information retrieval
PDF Full Text Request
Related items