Font Size: a A A

Research On Techniques For Mining Graph Patterns

Posted on:2011-08-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:1118360332457943Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Graph is a general structure for modeling complex relationships between data objects.For example, graphs can model chemical compound structures, protein interactionnetworks, social networks and web graphs etc. With the increasing popularity of graphdata in the scientific and engineering fields, discovering useful knowledge from graphdatabases has become one of the important research topic in the data mining domain.Graph pattern mining is the important research branch in graph mining, since many applicationsincluding graph query, classification and clustering rely on graph patterns tomanage, query and analyze graph data. In this dissertation, we focus on the techniquesfor mining graph patterns, review the current research results and propose some new researchproblems and approaches for mining graph patterns. The main contributions of thedissertation are as follows:(1) We propose a new problem of mining representative patterns from graphdatabases. Existing frequent subgraph mining algorithms often generate explosive graphpatterns, which has deterred their practical applications to a large extent. Mining representativepatterns not only can dramatically reduce the number of output graph patterns butalso can capture interesting graph patterns. We give a formal definition of mining representativepatterns, and prove that it is NP-hard. We propose several novel concepts such asδ-cover graph, jump value andδ-jump pattern. We study the property ofδ-jump patternsand find that allδ-jump patterns must be representative patterns. Based on this propertyofδ-jump patterns, we propose three efficient algorithms, RP-FP, RP-GD and RP-Leap.RP-FP and RP-GD can mine a whole set of representative patterns, whereas RP-Leap canmine an approximate set of representative patterns. RP-FP computes representative patternsfrom closed frequent graph patterns, and can provide a tight approximation bound.However, when the number of closed frequent graph patterns is large, RP-FP is not efficient.RP-GD adopts the idea of online algorithm, and mines representative patternsdirectly from graph databases. Algorithm complexity analysis shows that RP-GD canachieve a substantial performance gain over RP-FP. RP-Leap makes use of the similaritybetween many branches in the graph pattern space, skips those branches that generate fewrepresentative patterns, and mines a approximate set of representative patterns. Extensive experiments show the following results. (1) RP-FP, RP-GD and RP-Leap can obtain aquite small set of interesting representative patterns. (2) RP-GD can mine representativepatterns more efficiently than RP-FP. Furthermore, RP-GD is very close to RP-FP in termsof results quality. (3) Compared with RP-GD, RP-Leap achieves an order of magnitudespeedup at the cost of losing a few representative patterns.(2) We propose a new problem of mining kernel substructures from graph databases.Kernel substructures often exist in real graph data. For example, functional groups in aclass of chemical compounds belong to kernel substructures. Based on the characteristicsof kernel substructures, we give a formal definition of kernel substructures, calledΔ-jump pattern. We find some interesting properties ofΔ-jump patterns. They are stable,i.e., they exhibit a certain robustness against noises and changes in the database. With theincrease of the value ofΔ,Δ-jump patterns possess stronger stability. However, miningΔ-jump patterns is challenging due to the absence of anti-monotone property for them.By exploring the underlying properties associated withΔ-jump patterns, we propose twonovel pruning techniques, internal-extension-based pruning and external-extension-basedpruning. We devise an efficient algorithm GraphJP based on two novel pruning techniques.We theoretically prove the correctness of the pruning techniques and the algorithmGraphJP. Experimental results demonstrate that the novel pruning techniques are effectivein pruning the unpromising parts of search space, GraphJP is efficient and scalablein mining frequent jump patterns, and the mining results contain kernel substructures ingraph databases.(3) We propose a new problem of mining top-k patterns from graph databases basedon some joint significance measure. The traditional top-k mining doesn't consider thecorrelation between graph patterns, and thus top-k patterns are similar to each other withrespect to structure. If users obtain one of them, the other patterns become uninterestingfor users. Because the joint significance measure is defined on the whole set of patternsinstead of a single pattern, mining top-k patterns based on some joint significance measureexcludes the correlation between graph patterns implicitly, and thus can obtain a diverseset of significant patterns. We first discuss the joint significance measure that can be appliedto a set of patterns. By exploiting the concepts of information theory, we give twoproblem formulations, MES and MIGS, and prove that they are NP-hard. Two efficientalgorithms, Greedy-TopK and Cluster-TopK, are proposed for this new problem. Greedy- TopK first generates frequent subgraphs, and then incrementally and greedily selects Kgraph patterns from frequent subgraphs. When a given significance measure satisfies thesubmodular property, Greedy-TopK can provide a tight approximation bound. To enhancethe efficiency of Greedy-TopK, we design some pruning techniques for the joint measuresin MES and MIG that can be integrated the framework of frequent subgraph mining toprune the search space. However, Greedy-TopK are not efficient and not scalable whenthe number of frequent subgraphs is large. To overcome the shortcoming of Greedy-TopK,Cluster-TopK first mines a set of representative patterns for frequent subgraphs, and thenincrementally and greedily selects K graph patterns from representative patterns. The advantageof Cluster-TopK is that it can effectively obtain a set of representative patternsinstead of mining all frequent subgraphs. Furthermore, we theoretically show that thesolution obtained by Cluster-TopK is very close to that obtained by Greedy-TopK. Experimentalresults demonstrate that the Top-K mining proposed in this dissertation is superiorto the traditional Top-K mining in terms of results usefulness. Cluster-TopK can achieveone or two orders of magnitude speedup than Greedy-TopK, while achieving comparablemining results in terms of quality and usefulness.(4) We study the problem of graph classification, and propose a graph classificationapproach CEP based on emerging patterns. CEP contains three steps. In the first step, CEPmines closed frequent graph patterns, and use them as candidate classification features. Inthe two step, CEP obtains the emerging patterns from closed frequent graph patterns. Inthis step, we need to compute the supports of graph patterns in the databases with differentclasses, which involves substantial subgraph testing. To improve the efficiency of CEP, weorganize all closed frequent graph patterns into a tree-like structure T. For each graph Gin the database, we traverse this tree-like structure T in the depth-first manner. During thetraversal, we perform the pruning by the anti-monotone property of support. Specifically,if a graph G doesn't contain some node P in T, G cannot contain any descendant of P.In this manner, we can dramatically reduce the number of subgraph testing in this step.In the third step, we construct the classification rules based on the emerging patterns. Forconstructing effective classification rules, we propose a new feature selection approachthat both considers the coverage of emerging patterns for graph data and considers thedistribution of emerging patterns in graph data. Experimental results show that CEP canachieve excellent classification performance. Furthermore, classification rules generated by CEP can be easily understood and exploited by domain experts.
Keywords/Search Tags:graph mining, frequent subgraph, representative pattern, jump pattern, emerging pattern, Top-K mining
PDF Full Text Request
Related items