Font Size: a A A

Sampling-based randomization techniques for approximate query processing

Posted on:2008-05-17Degree:Ph.DType:Dissertation
University:University of FloridaCandidate:Joshi, ShantanuFull Text:PDF
GTID:1448390005950023Subject:Computer Science
Abstract/Summary:
The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPC-H, which show that many typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data exploratory analysis. One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. Two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable sampling-based estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary range query. (b) Sampling-based estimators for aggregate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of aggregate queries having highly selective predicates.
Keywords/Search Tags:Queries, Sampling, Random, Approximate, Query
Related items