Mining and indexing graph databases  Posted on:20140811  Degree:Ph.D  Type:Dissertation  University:The Pennsylvania State University  Candidate:Yuan, Dayu  Full Text:PDF  GTID:1458390008961272  Subject:Computer Science  Abstract/Summary:   Graphs are widely used to model structures and relationships of objects in various scientific and commercial fields. Chemical molecules, proteins, malware systemcall dependencies and threedimensional mechanical parts are all modeled as graphs. In this dissertation, we propose to mine and index those graph data to enable fast and scalable search.;There are two common search scenarios: subgraph search and supergraph search. In a subgraph search, given a query graph q, the algorithm searches for all graphs that have q as a subgraph, from a graph database. A supergraph search, on the other hand, retrieves all the database graphs that have q as a supergraph. Determining whether a graph is a subgraph (or supergraph) of another is an NPcomplete problem. Hence, it is challenging to efficiently process graph queries on large databases. Graph indices are commonly built to fast process graph queries. Subgraph patterns are mined from the graph database to build such graph indices. It has been shown that both the index structure of the graph indices and the subgraph patterns that are selected for indexing determine the time of processing graph queries. We address the graph search problem by designing innovative index structures and proposing new subgraph pattern mining algorithms.;First, we propose an index structure, Lindex, which organizes subgraph patterns in a lattice. As a novel index structure, Lindex (1) is effective in filtering false graphs, (2) provides fast index lookups, (3) is fast with respect to index construction and maintenance, and (4) can be constructed using any set of substructure index patterns. These four properties result in a fast and scalable graphquerying infrastructure. Lindex is compatible with any choice of patterns. Empirically, we demonstrate that Lindex used in conjunction with subgraph patterns proposed in previous works outperforms other specifically designed index structures.;Second, we address the pattern mining problem by modeling it as a combinatory optimization problem. Instead of mining subgraph patterns based on heuristics as in previous works, we proposed a greedy algorithm, which mines index patterns that minimize the queryprocessing time with theoretical guarantee. Specifically for the supergraph search, previous algorithms are proposed to optimize either a filtering gain or a prefixsharing gain. No single subgraphmining algorithm considers both gains, although these two gains jointly model the savings of the queryprocessing time. We are the first one to mine subgraph patterns to optimize both the filtering gain and the prefixsharing gain.;Finally, we study the update of graph indices. All previous algorithms are mineatonce algorithms in which the whole set of index patterns is mined in one run before building a graph index. Because of the change of environments, such as an update of the graph database and the increase of available memory, the index needs to be updated to accommodate such changes. Most of the mineatonce algorithms are timeconsuming because they involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves expensive disk operations. Hence, it is inefficient to remine the patterns and rebuild the index from scratch. We propose an iterative mining algorithm and a onepass mining algorithm to address the indexupdate problem. Both algorithms replace an old index pattern p' with a newly selected subgraph pattern p until the decrease of the queryprocessing time caused by the swap is lower than a threshold. The iterative mining algorithm always mines the new subgraph pattern p that maximizes the decrease of the queryprocessing time. And it can achieve an approximation ratio of 1  1/e. The onepass mining algorithm, on the other hand, can choose any pattern p that meets a criterion f(p, p ') > 0, where the function f(·) measures the advantage of indexing p over p '. The onepass mining algorithm have an approximation ratio of 1/4 with faster runningtime than iterative mining..;We conduct extensive empirical studies to study the performance of our proposed algorithms. The experiment results demonstrate the effectiveness of our algorithms by showing their improvement over the stateoftheart approaches over existing benchmark datasets.  Keywords/Search Tags:  Graph, Index, Mining, Algorithms, Queryprocessing time, Search, Over   Related items 
 
