Correlation-aware statistical methods for sampling-based group by estimates

Posted on:2010-02-11

Degree:Ph.D

Type:Thesis

University:University of Florida

Candidate:Xu, Fei

Full Text:PDF

GTID:2448390002482240

Subject:Computer Science

Abstract/Summary:

Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very difficult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a significant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three specific contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwise estimates. Second, I develop methods to provide correlation-aware simultaneous confidence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all "top-k" groups with high probability when only database samples are available.

Keywords/Search Tags:

GROUP, Correlation-aware statistical methods, BY queries, BY query, Approximate query processing, Analytical

Related items

1	Research On Key Technologies Of Distributed Rank-aware Query Processing
2	Research Of Approximate Query Processing Technology For Large Scale Data
3	Key Technologies Of Big Data Approximate Computing Based On Synopses
4	Sampling-based randomization techniques for approximate query processing
5	Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data
6	Fuzzing Methods For Query Processing Functionality Of Analytical Databases
7	Research On Query Processing Technologies Over Large Scale Knowledge Graphs
8	The Research Of Key Techniques Of Uncertain TOP-K Query Processing
9	Research On Key Methods Of Efficient Multi-dimensional Online Analytical Processing Query
10	Analytical query processing in data intensive applications