Font Size: a A A

Mining interesting subgraphs by output space sampling

Posted on:2010-03-02Degree:Ph.DType:Thesis
University:Rensselaer Polytechnic InstituteCandidate:Hasan, Mohammad AlFull Text:PDF
GTID:2448390002479562Subject:Computer Science
Abstract/Summary:
Lack of scalability of the mining process and the enormous size of the output set are two significant bottlenecks of Frequent Subgraph Mining (FSM). The first restricts the applicability of FSM to large datasets. The second makes it difficult for the user to analyze the frequent patterns for subsequent usage in typical knowledge discovery tasks, such as classification, clustering, outlier detection, etc.;In this thesis, I propose output space sampling (OSS) to alleviate the above two problems. In this paradigm, the objective is to sample frequent patterns instead of complete enumeration. The sampling process automatically performs the interestingness based selection by embedding the interestingness score of the patterns in the desired target distribution. This obviates a two-step mechanism since the sampling automatically prefers the patters that are interesting. Another important point to note is that OSS is a generic method that applies for any kind of patterns such as a set, a sequence, a tree and of-course a graph.;OSS is based on Markov Chain Monte Carlo (MCMC) sampling. It performs a random walk on the candidate subgraph partial order and returns subgraph samples when the walk converges to a desired stationary distribution. The transition probability matrix of the random walk is computed locally to avoid a complete enumeration of the candidate frequent patterns, which makes the sampling paradigm scalable to large real life graph datasets. Output space sampling is an entire paradigm shift in frequent pattern mining (FPM) that holds enormous promise. While traditional FPM strives for completeness, OSS targets to obtain a few interesting samples. The definition of interestingness can be very generic, so user can sample patterns from different target distributions by choosing different interestingness functions. This is very beneficial as mined patterns are subject to subsequent use in various knowledge discovery tasks.;Output space sampling is a general idea that has various applications. In this thesis, we utilize this general idea to solve specific problems. We consider two different problems: (1) frequent pattern summarization (2) sampling discriminatory patterns. For both the above problems, we assume that the direct mining task is infeasible, so that the user wants to adopt sampling to find few interesting patterns in a reasonable amount of time. We also use the concept of OSS to find representative patterns that are very different from each other. OSS naturally supports this requirement as it obtains random samples which are very different from each other. In this thesis, two different algorithms are proposed for representative pattern mining, which are introduced in the next two paragraphs.;The first algorithm for the representative pattern mining that is proposed in this thesis is called MUSK. It is based on a uniform sampling of the output space. It obtains representative patterns by sampling uniformly from the pool of all frequent maximal patterns; uniformity is achieved by a variant of Markov Chain Monte Carlo (MCMC) algorithm. MUSK follows the concept of OSS by sampling from a target distribution where the maximal patterns have uniform value for the interestingness score.;The second algorithm that we propose for this task is ORIGAMI . It defines the representative pattern-set (R) in a novel manner that attempts to reduce structural similarities among patterns in R while extending the coverage of frequent pattern space as much as possible. Intuitively, two patterns are alpha-orthogonal if their similarity is bounded above by alpha. Each alpha-orthogonal pattern is also a representative for those patterns that are at least beta similar to it. Given user defined alpha, beta ∈ [0, 1], the goal of O RIGAMI is to mine an a-orthogonal, beta-representative set that minimizes the set of unrepresented patterns. Similar to OSS paradigm, O RIGAMI uses a randomized algorithm to randomly traverse the pattern space, seeking previously unexplored regions, to return a set of maximal patterns. But, uncharacteristic to OSS, ORIGAMI employs a second-step to extract an alpha-orthogonal, beta-representative set from the mined maximal patterns using a local optimal algorithm. The second step is essential to provide the alpha-orthogonal, beta-representative guarantee. (Abstract shortened by UMI.)...
Keywords/Search Tags:Mining, Sampling, Output, Patterns, OSS, Interesting, Representative, Algorithm
Related items