Font Size: a A A

Graph based methods for pattern mining

Posted on:2008-11-28Degree:Ph.DType:Thesis
University:Michigan State UniversityCandidate:Moonesinghe, H. D. KFull Text:PDF
GTID:2448390005968504Subject:Computer Science
Abstract/Summary:
Pattern mining is the automatic extraction of novel and potentially useful knowledge from large databases. It plays a major role in many applications, such as intrusion detection, business intelligence, and bio-medical applications. This thesis explores two important pattern mining tasks, namely, frequent pattern mining and anomalous pattern mining. We illustrate the limitations of existing pattern mining algorithms and develop a class of algorithms inspired by graph theoretic principles to overcome these limitations.; First, we demonstrate characteristics of existing frequent pattern mining algorithms that prohibit them from scaling-up to very large databases. We introduce a novel data structure known as a PrefixGraph to compress crucial information about the database in order to facilitate fast enumeration of frequent patterns. An efficient pattern search and pruning algorithm called PGMiner was also developed using principles derived from network flow analysis.; Second, we investigate a class of anomalies that are difficult to distinguish from normal observations, resulting in high false alarm rates in many anomaly detection algorithms. To address this problem, we introduce OutRank, a stochastic graph based algorithm for unsupervised anomaly detection. In this method, a graph is constructed using two approaches: one based on the similarity between objects and the other using shared neighbors of the object. The heart of this approach uses a Markov chain random walk model to determine the anomaly score of an object.; Recent years have also witnessed the proliferation of graph-based data, spurred by advances in application areas such as bioinformatics, chem-informatics, Web 2.0, and sensor networks. In this thesis, we systematically develop an anomaly detection framework to detect anomalies in graph-based data. First, frequent subgraph patterns are extracted from the data. We investigate two methods for utilizing the frequent subgraph patterns in graph anomaly detection. The first approach, gRank, is an extension of OutRank to graph-based data. It uses a substructure-based similarity measure to create a stochastic graph and then performs random walk to determine the anomaly score of a graph. The second approach, called gEntropy, creates features based on the frequent subgraph patterns and builds a probability model for the graphs using the maximum entropy principle. Anomalous graphs are detected based on the probability that such a graph is generated by the model. Finally, we extend the maximum entropy approach to supervised anomaly detection and show that it produces significant improvements over the unsupervised case.; Graph based methods have been applied in several areas such as machine learning and networking. In this thesis, we show how graph based methods can be applied successfully in pattern mining area. More specifically, we show how graph theoretic approaches can be utilized to improve the efficiency and the effectiveness of frequent item- set mining and anomaly detection. With the advances in technology, a huge amount of data is accumulated everyday and database sizes have grown to Terabyte and Petabyte scale. We believe the algorithms developed in this thesis are a step forward towards making pattern mining more effective for such large and complex data sets.
Keywords/Search Tags:Pattern mining, Graph, Data, Anomaly detection, Large, Thesis
Related items