Font Size: a A A

Probabilistic query models for transaction data

Posted on:2003-06-10Degree:Ph.DType:Dissertation
University:University of California, IrvineCandidate:Pavlov, Dmitry YurievichFull Text:PDF
GTID:1468390011979276Subject:Computer Science
Abstract/Summary:
Interactive querying of massive data sets is an increasingly important application in the fields like interactive data mining, exploratory data analysis, prediction of customer behaviour and optimization of database management systems (DBMS). Formally, the task can be formulated as follows: given a binary transaction data set and a query on a subset of its attributes, find (predict) the probability that a randomly selected data record satisfies the query. A straightforward solution of scanning the data directly does not scale up to typical transaction data sets that are huge in a data mining sense. Although fast, memory efficient and one of the most popular solutions in commercial DBMS, the independence model has long been criticized for often being quite inaccurate. Other techniques proposed in the literature suffer from the curse of dimensionality and are practical only for low-dimensional data sets.; In this dissertation, we develop a number of novel probabilistic approaches to the querying problem. Unlike previously proposed methods our techniques satisfy the following requirements: (a) they work on high-dimensional binary transaction data, and (b) they allow the user to effectively trade accuracy in the estimates for query execution time and memory taken by the model. We empirically explore the tradeoffs in terms of accuracy, time and memory offered by these models and establish that the choice of the best-performing model for any particular querying situation depends on a variety of factors, such as the density of the data and the distribution of user queries. We illustrate that no single model universally dominates.; This naturally leads to the question of how to automatically determine which approximation model is optimal for any given situation. To resolve this question, we leverage recent results in the machine learning and statistics literature that show that combinations of models can outperform any single model for prediction tasks. Specifically, for query approximation, we propose a data-adaptive technique for combining probabilistic query approximation models. We demonstrate that on real-world and simulated data sets the combined model can reduce the error of any single model by factors of up to 50%. Furthermore, we demonstrate how time and memory constraints can be easily incorporated into the model-combining algorithm. Adapting the model is a straightforward and scalable optimization problem that can be completely automated, providing a practical and query-adaptive framework for online query approximation.
Keywords/Search Tags:Query, Data, Model, Probabilistic
Related items