Probabilistic query models for transaction data

Posted on:2003-06-10

Degree:Ph.D

Type:Dissertation

University:University of California, Irvine

Candidate:Pavlov, Dmitry Yurievich

Full Text:PDF

GTID:1468390011979276

Subject:Computer Science

Abstract/Summary:

Interactive querying of massive data sets is an increasingly important application in the fields like interactive data mining, exploratory data analysis, prediction of customer behaviour and optimization of database management systems (DBMS). Formally, the task can be formulated as follows: given a binary transaction data set and a query on a subset of its attributes, find (predict) the probability that a randomly selected data record satisfies the query. A straightforward solution of scanning the data directly does not scale up to typical transaction data sets that are huge in a data mining sense. Although fast, memory efficient and one of the most popular solutions in commercial DBMS, the independence model has long been criticized for often being quite inaccurate. Other techniques proposed in the literature suffer from the curse of dimensionality and are practical only for low-dimensional data sets.; In this dissertation, we develop a number of novel probabilistic approaches to the querying problem. Unlike previously proposed methods our techniques satisfy the following requirements: (a) they work on high-dimensional binary transaction data, and (b) they allow the user to effectively trade accuracy in the estimates for query execution time and memory taken by the model. We empirically explore the tradeoffs in terms of accuracy, time and memory offered by these models and establish that the choice of the best-performing model for any particular querying situation depends on a variety of factors, such as the density of the data and the distribution of user queries. We illustrate that no single model universally dominates.; This naturally leads to the question of how to automatically determine which approximation model is optimal for any given situation. To resolve this question, we leverage recent results in the machine learning and statistics literature that show that combinations of models can outperform any single model for prediction tasks. Specifically, for query approximation, we propose a data-adaptive technique for combining probabilistic query approximation models. We demonstrate that on real-world and simulated data sets the combined model can reduce the error of any single model by factors of up to 50%. Furthermore, we demonstrate how time and memory constraints can be easily incorporated into the model-combining algorithm. Adapting the model is a straightforward and scalable optimization problem that can be completely automated, providing a practical and query-adaptive framework for online query approximation.

Keywords/Search Tags:

Query, Data, Model, Probabilistic

Related items

1	An Improved Probabilistic Relational Model And Its Probability To Queries
2	Probabilistic query models for transaction data
3	Research On Answering Why-not Questions Over Probabilistic Reverse Top-k Queries
4	Research On The Methods Of Uncertainty Data Indexing And Querying In Mobile Environments
5	Representing And Real-time Probabilistic Query Processing To Uncertain Data With Complex Correlation
6	The Research On Query Evaluation And Top-k Query In Probabilistic Relation
7	Scalable Query Technology Over Probabilistic Databases
8	Studies On Probabilistic Nearest Neighbor Queries Over Uncertain Data
9	Systematic Representation And Query Processing Of Uncertain Data Based On Bayesian Network
10	Research On XML Data Theory Based On Probability