Font Size: a A A

The Research On Postgresql Statistics Estimation Based On Block-level Sampling

Posted on:2008-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2198330332481736Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer hardware, software technology, and the comprehensive application of computer system in all trades and professions, data has become valuable resource in all kinds of institutions. Nowadays, database system is very important for scientific research departments,governmental entities and enterprises etc. With time passing, the amount of data goes increasingly, while one of the most important characters of DBMS is maintaining one acceptable level about its built systems. The query optimizer of database is primary mechanism of holding this kind of function.There are two kinds of query optimizers, the cost-based and rule-based optimizer. Most of the query optimizer of current commercial DBMS is the cost-based optimizer. The advantage of the cost-based optimizer comparing with the rule-based one is that the cost-based can estimate query cost based on the special information of the database objects, so the DBMS can choose the optimal execute path for the query, while the special information is the statistics named and gathered by DBMS. As a result, the precision of statistics estimation has prominent influence to the result of query cost estimation, and also has crucial influence to the quality of optimizer. DBMS can gather the statistics by estimating or computing, the former has high precision but high cost, especially for analyzing large objects, it may increase the system load; the latter has lower precision, but less cost and less load, even analyzing the large objects.This paper is primary based on the open source DBMS PostgreSQL, analyzing the source codes about query optimizer and how DBMS gathering statistics, and it emphasizes the histograms and distinct-value which are crucial statistics to query cost estimation, developing the way that uses computing and estimating to gather the statistics in the DBMS. It uses block-level sampling technology in sampling estimation, and uses cross-validation algorithm that is based on histograms to lessen the samples data bias below the threshold value which is destined for, then constructs the equi-depth histogram or value-based histogram on samples, and saves them into system table which can be used by the optimizer. When estimate the Distinct-value, because the layout of the data sets would bring bias on samples, we use hybrid-based estimation to resolve the problem which is arose by lost of the fl value(the number of the value only appears one time).Through the experiment under Linux AS3 and PostgreSQL8.1, it proves that the resolution in this paper is fit for PostgreSQL, and also improves the efficiency in gathering statistics with guaranteeing the estimating precision.
Keywords/Search Tags:Query optimize, Statistics, Cost estimation, Histogram, Cross-valida-tion, Distinct-value
PDF Full Text Request
Related items