Font Size: a A A

Correlation-aware statistical methods for sampling-based group by estimates

Posted on:2010-02-11Degree:Ph.DType:Thesis
University:University of FloridaCandidate:Xu, FeiFull Text:PDF
GTID:2448390002482240Subject:Computer Science
Abstract/Summary:
Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very difficult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a significant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three specific contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwise estimates. Second, I develop methods to provide correlation-aware simultaneous confidence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all "top-k" groups with high probability when only database samples are available.
Keywords/Search Tags:GROUP, Correlation-aware statistical methods, BY queries, BY query, Approximate query processing, Analytical
Related items