Sampling-based randomization techniques for approximate query processing

Posted on:2008-05-17

Degree:Ph.D

Type:Dissertation

University:University of Florida

Candidate:Joshi, Shantanu

Full Text:PDF

GTID:1448390005950023

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPC-H, which show that many typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data exploratory analysis. One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. Two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable sampling-based estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary range query. (b) Sampling-based estimators for aggregate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of aggregate queries having highly selective predicates.

Keywords/Search Tags:

Queries, Sampling, Random, Approximate, Query

PDF Full Text Request

Related items

1	The Research Of Approximate Computing For Big Data Based On Sampling And Coreset
2	Key Technologies Of Big Data Approximate Computing Based On Synopses
3	Research On (Îµ, Î´)-Approximate Query Processing Algorithms In Sensor Networks
4	Research And Implementation Of Sampling-Based Aggregation Query System On Big Data
5	Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data
6	Correlation-aware statistical methods for sampling-based group by estimates
7	Approximate answering of aggregate queries in relational databases
8	Research On Sampling Based Aggregate Query Method Of Power Quality Data
9	Approximate Query Processing Technology Based On Distribution Perception
10	Research On Approximate Query Processing Technology Based On Multidimensional Analysis Of Big Data