Font Size: a A A

Query and Mining in Large Graph Databases

Posted on:2014-10-13Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Zhu, YuanyuanFull Text:PDF
GTID:2458390008454796Subject:Engineering
Abstract/Summary:
Graph has powerful ability to model complex structural relationships among data objects and has been widely used in various applications. Along with the development of the application domains, graph databases become large and are growing rapidly in size. This brings researchers new challenges on graph query and mining, among which we mainly focus on investigating the following three problems: how to find the correspondence between the nodes of two large graphs so that some substructures in one graph are mapped to similar substructures in the other; another problem is how to retrieve similar graphs for a query graph from a graph database consisting of a large number of graphs; and the last problem is how to extract subgraph features to build an automated classification model for a graph database containing graphs which belong to different classes.;In this thesis, for the first problem, we propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first stage, we design an anchor-selection/expansion scheme to construct a good initial matching heuristically. In the second stage, we propose a new approach to refine the initial matching and give the optimality of our refinement algorithm. Our approach can produce an approximate matching result with high quality and efficiency. To address the second problem, we introduce a new graph distance measure based on the maximum common subgraphs (MCS) of two graphs which can thoroughly capture the common as well as different structures of two graphs. Since computing the MCS of two graphs is NP-complete, to answer the top-k graph similarity query efficiently, we propose a fast algorithm which can significantly reduce the number of MCS computations. This algorithm prunes the unqualified graphs based on three lower bounds in which the first two are derived based on the structures of two graphs and the third is obtained based on the triangle property of the distance measure. Three index schemes are designed with different tradeoffs between pruning power and construction cost to assist the query processing. For the third problem, we identify two main issues of the current widely-used discriminative score for feature selection, and introduce a new diversified discriminative score to explore the additional value of the diversity together with the discriminativity. We analyze the properties of the newly-proposed diversified discriminative score from several perspectives and demonstrate that this score can make positive/negative graphs more separable. New algorithms are also proposed to select features based on the new score and they are shown to have high classification accuracy.
Keywords/Search Tags:Graph, Large, Query, New, Score
Related items