Font Size: a A A

Statistics on query expressions in relational database management systems

Posted on:2004-07-08Degree:Ph.DType:Thesis
University:Columbia UniversityCandidate:Bruno, NicolasFull Text:PDF
GTID:2468390011471480Subject:Computer Science
Abstract/Summary:
The query optimizer is the component in a relational database system that identifies efficient execution plans for input queries. Modern optimizers generally explore many alternative query plans in a cost-based manner. Specifically, the resource consumption and associated cost of each candidate plan is estimated, and the plan with the least expected cost is chosen for execution. The cost estimation for a plan depends on several factors, including resource availability during execution, the specific operators that compose the plan, and the size of intermediate results that would be generated during the plan execution. Among these factors, the intermediate-result size (or cardinality) estimation is the main source of inaccuracies during optimization: cardinality estimation typically relies on several simplifying assumptions that often do not hold in practice. Optimizers then sometimes base their decisions on inaccurate information and produce low-quality execution plans. To address this limitation, in this thesis we introduce the concept of SITS, which are statistics built on query expressions. SITS directly and accurately model intermediate results in a query execution plan, and therefore avoid error-prone simplifying assumptions during cardinality estimation. If optimizers have appropriate SITS available during optimization, the resulting query plans can be dramatically better than otherwise. Although SITs are a fairly simple concept, challenging problems need to be addressed before SITs can be seamlessly integrated into modern relational database systems. In this thesis we study three important challenges associated with SITS. First, we show how to modify query optimizers to exploit the additional statistical information provided by SITs without significantly increasing optimization time. Second, we study a spectrum of alternatives to create SITs, which balance efficiency of construction and accuracy of the resulting estimators. Third, we present techniques to recommend a small but highly beneficial set of SITs to materialize in a database system for a given query workload. In summary, we address the main obstacles for enabling SITs for optimization, namely which SITs to build, how to build them, and how to exploit them during optimization. Overall, SITs constitute a well-founded approach for dealing with complex data correlations, and positively impact the efficiency of relational database systems.
Keywords/Search Tags:Relational database, SITS, Query, Plan, Execution
Related items