Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

Posted on:2009-10-11

Degree:Ph.D

Type:Dissertation

University:The Ohio State University

Candidate:Wang, Chao

Full Text:PDF

GTID:1448390005458718

Subject:Computer Science

Abstract/Summary:

This work seeks to develop a probabilistic framework for modeling, querying and analyzing large-scale structured and semi-structured data. The framework has three components: (1) Mining non-redundant local patterns from data; (2) Gluing these local patterns together by employing probabilistic models (e.g., Markov random field (MRF), Bayesian network); and (3) Reasoning (making inference) over the data for solving various data analysis tasks. In more detail, our contributions are as follows:; Mining non-redundant frequent itemset patterns on large transactional data. Often times in many real-world problems frequent pattern mining algorithms yield so many frequent patterns that the end-user is swamped when it comes to interpreting the results. We present an approach of employing probabilistic models to identify non-redundant itemset patterns from a large collection of frequent itemsets on transactional data. We show that our approach can effectively eliminate a large amount of redundancy from a large collection of itemset patterns.; Employing local probabilistic models to glue non-redundant itemset patterns on large transactional or network data. We propose a technique of employing local probabilistic models to glue non-redundant itemset patterns together in tackling the link prediction task in co-authorship network analysis. The new technique effectively combines topology analysis on network structure data and frequency analysis on network event log data. The main idea is to consider the co-occurrence probability of two end nodes associated with a candidate link. We propose a method of building MRFs over local data regions to compute this co-occurrence probability. Experimental results demonstrate that the co-occurrence probability inferred from the local probabilistic models is very useful for link prediction.; Employing global probabilistic models to glue non-redundant itemset patterns on large transactional data. We explore employing global models, models over large data regions, to glue non-redundant itemset patterns together. To this end, we investigate learning approximate global MRFs on large transactional data and propose a divide-and-conquer style modeling approach. Empirical study shows that the models are effective in modeling the data and approximately answering queries on the data.; Mining non-redundant tree patterns and employing probabilistic approaches to glue them on large XML data. We propose a technique of identifying non-redundant tree patterns from a large collection of structural tree patterns. We show that our approach can effectively eliminate redundancies from a large collection of structural tree patterns. Furthermore, we present techniques of employing these non-redundant tree patterns as summary statistics for the XML data to solve the XML twig selection estimation problem. We propose a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees. Empirical results demonstrate the efficacy of our approach on real and synthetic datasets.

Keywords/Search Tags:

Data, Probabilistic, Patterns, Non-redundant, Large, Local, Approach, Employing

Related items

1	A twist decomposition approach to solve functionally-redundant serial manipulators
2	Learning data driven representations from large collections of multidimensional patterns with minimal supervisio
3	Biodiversity in a rapidly changing world: From local interactions to large scale patterns
4	Rapid three-dimensional tracing of the mouse brain neurovasculature with local maximum intensity projection and moving windows
5	Finding spatio-temporal patterns in large sensor datasets
6	Managing large-scale probabilistic databases
7	The Research On Medical Data Parallel Clustering Algorithm Employing MapRedcue
8	An Improved Probabilistic Database Model And Its Probabilisticn Earest Neighbors Query Research
9	Deriving activity patterns from individual travel diary data: A spatiotemporal data mining approach
10	A probabilistic approach to data integration in biomedical research: The IsBIG experiments