Font Size: a A A

Research On Techniques For Query Processing On Graph Databases

Posted on:2011-11-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:1118360332456382Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Graph is a general data structure in computer science, and can be used for modelingcomplex relationships between data objects. For example, graphs can model chemicalcompound structures, protein interaction networks and social networks etc.. With therapid proliferation of graph data in various domains, graph data management has becomean important research subfield in the data management domain. Query processing ongraph databases is one of the most important and fundamental research topics in graphdata management, as a large number of graph-related tasks and many applications, suchas graph mining and the PubChem chemical database, rely on processing queries in anefficient way. In this dissertation, we focus on the techniques for processing querieson graphs databases, review the current research results, formulate several new researchproblems and propose some novel approaches for efficiently query processing. The maincontributions of the dissertation are as follows:1. We propose a novel approach for efficiently processing supergraph queries ongraph databases. The proposed new approach involves the techniques of (1) compactlyorganizing graph databases, (2) constructing index by using significant subgraph fea-tures and (3) processing queries based on the compact organization and the index. (1)For graph query processing, many existing methods use the filtering-and-verification (orF&V) methodology, i.e. first fast filter some negative results to obtain a candidate set andthen verify each candidate. For processing supergraph queries, the existing F&V-basedmethod arranges the graphs in databases one by one separately. We propose to mergesome common subgraphs shared by multiple graphs and compactly organizing graphdatabases, which derives a logical data structure, namely GPTree. In order to form a goodGPTree, an optimization problem of optimal induced subgraph selecting is formulated.After proving it is NP-hard, we propose an approximation algorithm with ratio bound 2.(2) For constructing effective indices, two algorithms for generating significant indexedsubgraph features, which are independent of historical queries, are proposed. Based onthe generated indexed subgraph features, two indices CRGraph and FGPForest are de-vised. In particular, by analyzing the effectiveness of a set of subgraphs, a significancemeasure w.r.t. subgraph is defined. An algorithm for finding all the subgraphs with no less than a user-specified significance threshold is presented. Also, another algorithm forextracting an approximate set of the significant subgraphs is proposed, for the purpose offast index creating that is required by some real-life applications. In order to upgrade theeffectiveness of index, based on the mathematical analysis, we propose a method for op-timally ordering the extracted indexed subgraphs. (3) An algorithm, called GPTreeTest,for testing subgraph isomorphisms from multiple graphs in a GPTree to a query graph isproposed. This new algorithm can utilize the information of common subgraphs recordedin a GPTree and reduce the overall computational cost for subgraph isomorphism test-ing. The experimental results on both real and synthetic datasets show the new approachoutperform the existing counterpart by one to two orders of magnitude.2. We present a new problem of probabilistic top-k subgraph matching query onuncertain graph databases, and propose a query processing method. Firstly, a uncertaingraph data model is presented. Based on the data model, the formulation of probabilis-tic top-k subgraph matching query is introduced. The neighborhood subgraph w.r.t. avertex in a graph is the subgraph consisting of all the vertices with the distance no morethan a threshold and all the edges between these vertices. By using the characteristics ofspace-locality of structures in a graph, based on neighborhood subgraphs, we propose aneffective index structure, namely NG-Index, with probabilistic information in a data un-certain graph. NG-Index can be easily implemented in relational databases. In addition,based on NG-Index, we present an efficient search-tree-based algorithm with probabilis-tic pruning technique to search large uncertain graphs. Experiment results validate theefficiency and scalability of the proposed techniques.3. We present a new problem of querying graph statistics in the presence of concepthierarchy. We formulate the user search interest by a concept graph. Given a concepthierarchy, a user-issued concept graph, a solution to the problem aims at efficiently evalu-ating the statistics of a given data graph. Firstly, we present a definition of graph statistics,which is based on the number of subgraph matchings of user-issued concept graphs. Then,we devise an index structure based on subgraph density, and propose a two-phase evalu-ation method. In the first phase, we fast break apart the original data graph into a set ofsmall-size connected graphs; in the second phase, each small graph is evaluated and thenthe results are merged together. The core sub-problem involved in the second phase iscomputing the number of subgraph matchings of concept graphs. For this sub-problem, we propose two algorithms, i.e. forward computation and backward computation. At last,the experimental results on real datasets show the elegance of the proposed method.4. We propose a new search-based method for testing subgraph isomorphisms formoderate-size labeled graphs; and after that we apply the proposed new techniques inthe GPTree data structure and propose GPTreeTest* algorithm. We analyze and utilizethe characteristics of labeled graphs, and reduce the search space by integratedly usingthe label and structure information. Firstly, labels are converted into numbers based onlabel frequency. Secondly, we record a set of fine-grained vertex invariants. Vertex invari-ants are some inherent properties of the vertices that do not change across isomorphismmappings. With the help of these fine-grained vertex invariants, label information is prop-agated through the structure in graphs, which reduces the overall search space. Thirdly,we derive pruning conditions from the fine-grained vertex invariants. These conditionsare quite strong since the label and structure information are combined together. Throughapplying the proposed new techniques in GPTree, we show the proposed new techniquesare able to improve a number of tasks. The experimental results show the high efficiencyof the proposed testing method and show that GPTreeTest* outperforms the original GP-TreeTest algorithm.
Keywords/Search Tags:Supergraph Query, Probabilistic Top-k Subgraph Matching Query, GraphStastics, Subgraph Isomorphism, Uncertain Graph
PDF Full Text Request
Related items