Font Size: a A A

Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

Posted on:2009-10-11Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Wang, ChaoFull Text:PDF
GTID:1448390005458718Subject:Computer Science
Abstract/Summary:
This work seeks to develop a probabilistic framework for modeling, querying and analyzing large-scale structured and semi-structured data. The framework has three components: (1) Mining non-redundant local patterns from data; (2) Gluing these local patterns together by employing probabilistic models (e.g., Markov random field (MRF), Bayesian network); and (3) Reasoning (making inference) over the data for solving various data analysis tasks. In more detail, our contributions are as follows:; Mining non-redundant frequent itemset patterns on large transactional data. Often times in many real-world problems frequent pattern mining algorithms yield so many frequent patterns that the end-user is swamped when it comes to interpreting the results. We present an approach of employing probabilistic models to identify non-redundant itemset patterns from a large collection of frequent itemsets on transactional data. We show that our approach can effectively eliminate a large amount of redundancy from a large collection of itemset patterns.; Employing local probabilistic models to glue non-redundant itemset patterns on large transactional or network data. We propose a technique of employing local probabilistic models to glue non-redundant itemset patterns together in tackling the link prediction task in co-authorship network analysis. The new technique effectively combines topology analysis on network structure data and frequency analysis on network event log data. The main idea is to consider the co-occurrence probability of two end nodes associated with a candidate link. We propose a method of building MRFs over local data regions to compute this co-occurrence probability. Experimental results demonstrate that the co-occurrence probability inferred from the local probabilistic models is very useful for link prediction.; Employing global probabilistic models to glue non-redundant itemset patterns on large transactional data. We explore employing global models, models over large data regions, to glue non-redundant itemset patterns together. To this end, we investigate learning approximate global MRFs on large transactional data and propose a divide-and-conquer style modeling approach. Empirical study shows that the models are effective in modeling the data and approximately answering queries on the data.; Mining non-redundant tree patterns and employing probabilistic approaches to glue them on large XML data. We propose a technique of identifying non-redundant tree patterns from a large collection of structural tree patterns. We show that our approach can effectively eliminate redundancies from a large collection of structural tree patterns. Furthermore, we present techniques of employing these non-redundant tree patterns as summary statistics for the XML data to solve the XML twig selection estimation problem. We propose a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees. Empirical results demonstrate the efficacy of our approach on real and synthetic datasets.
Keywords/Search Tags:Data, Probabilistic, Patterns, Non-redundant, Large, Local, Approach, Employing
Related items