Font Size: a A A

Mining, indexing and similarity search in large graph data sets

Posted on:2007-05-23Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Yan, XifengFull Text:PDF
GTID:1448390005464726Subject:Computer Science
Abstract/Summary:
Scalable analytical algorithms and tools for large graph data sets are in great demand across domains from software engineering to computational biology as it is very difficult, if not impossible, for human beings to manually analyze any reasonably large collection of graphs due to their high complexity. In this dissertation, we investigate two long standing fundamental problems: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph pattern mining is an expensive computational problem since subgraph isomorphism is NP-complete. Previous solutions generate inevitable overheads since they rely on joining two graphs to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that this kind of join operation is unnecessary. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a novel, frequent and discriminative graph mining approach that leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches.; Besides graph mining and search, this dissertation provides thorough investigation of pattern summarization, pattern-based classification, constraint pattern mining, and graph similarity searching, which could leverage the usage of graph patterns. It also explores several critical applications in bioinformatics, computer systems and software engineering, including gene relevance network analysis for functional annotation, and program flow analysis for automated software bug isolation.; The developed concepts, theories, and systems may significantly deepen the understanding of data mining principles in structural pattern discovery, interpretation and search. The formulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications in multiple domains.
Keywords/Search Tags:Graph, Mining, Search, Indexing, Similarity
Related items