Font Size: a A A

Performance Optimization Of SQL Computing For Column-oriented Database

Posted on:2010-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z T TanFull Text:PDF
GTID:2178360272497090Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of database with independent intellectual property is an important step for our information industry. There is urgent requirement of it not only in traditional field such as national defense and e-government,but also requirements that operating complex statistical analysis efficiently in some applied field such as statistical audit management, network monitoring analysis, historical analysis of the financial telecom.The GBase 8a is a homemade and analytic database with independent intellectual property,which is developed independently by nanda general company.GBase 8a is a Column-oriented database,but it uses a large number of technology of data management in in-memory database at the same time. In this paper, concentrating its focus on the characteristics of Column-Stores, we introduce optimization and the realization of performance based on the way of Column-Stores in detail.With the increasing scale of data, the demand of data analysis is more and more.Simply looking at a record has no significance, but to all the data from a statistical analysis of all that the performance of the database has a higher demand. Therefore, homemade analytic databases need to continuously improve the system performance to meet the needs of users, to improve their market competitiveness.Column-Stores is a new type of database storage,which has a essentially different with Row-Stores.In the type of Row-Stores, each table corresponds to a database page chains and each page of a database that contains one or more lines of database records. Database records of each field are stored in the order of pages in the database.There are some problems such as redundant I/O of disk, generating data debris of disk and low storage utilization in the Row-oriented database. The type of Column-Stores stores the data in the way of column. Compared with Row-Stores, it fractionizes the area of each page of storage. in data page it is no longer stored all the columns of one or more records but separate column of one or more data records. Column-Stores solves the problems such as redundant I/O of disk, generating data debris of disk and low storage utilization which happen in the Row-oriented database effectively.The clock frequency of CPU has been complying with Moore's Law(doubling about every 3 years). What's more,CPU has used a series of useful new technologies such as CPU pipelining, Cache technology, technical instruction prefetch, branch processing, out-of-order implementation,which speed up the speed of processing further. And now the superscalar CPU supports more pipelines at the same time in order to achieve an internal parallel processing, which impels the speed of processing to a new heights. And how to make rational use of these new features, new technologies have a very important significance. Compiler optimization can take advantage of new technologies to improve efficiency in the implementation code by using the ways of array of mergers, both within and outside the circle of exchange, the cycle integration and block.In this paper, the performance optimization of database is different from the common performance optimization.It optimizes the performance of database by using the characteristics of Column-Stores and modern CPU.It It aimes at the optimization of high-volume data and complex operation of function.This article focuses on the column-per-one-time and vector-per-one-time , which are based on the performance optimization of Column-oriented database and this is an incremental process.The column-per-one-time has improved performance compared to the traditional Volcano iterator model.The style of computing of Volcano iterator model is tuple-per-one-time.It results in a large number of useless operations.There is dependence on both frontal and posterior data,which leads to a lot of pipeline slots.It can't make good use of data prefetching techniques because of Row-Stores.And the instruction is depended with each other so that more pipelining can be used perfectly. The column-per-one-time operates data of all of one column at a time,so that useless operations only appear at the edges of each columns and even can be ignored in the case of large amount of data. Pipeline slots disappear and data prefetching techniques can be made good use of.What's more,more pipelining can be carried out at the same time.The authors put forward vector-per-one-time based on column-per-one-time,which can speed up the SQL computation ulteriorly.. Although the column-per-one-time can improve the performance of database systems in a large extent,it need to write the intermediate results of all of one column(if cache can't contain) to the main memory,which may cause capacity miss(one kind of cache miss).But vector-per-one-time can solve this problem effectively. The so-called vector-per-one-time's object of each operation is neither a record nor a whole column,but a vector. which is a block of the entire data of one column and the size of vector block accords to the size of the cache.So the intermediate results of a vector block can be left in the cache and don't need to write to the main memory.CPU can find them from the cache when they are needed by next operation,which avoids a lot of spending of I/O between main memory and cache. Selection vector is used in vector-per-one-time.In fact,it just is a array which records the position of source data after filtering.Then the selection vector and source data will be passed to the functions at the same time. At the same time, vector-per-one-time introduce two new technologies:the elimination of branches and Predefining of compound computing, which are still in their experimental stage and are achieved in static way, needing to achieve the purpose of dynamic realization. Finally, we carry out the TPC-H benchmark.Here our main TPC-H is on the first sql statement,because it only involves one table and dosen't contain join operation.So it can be the most accurate reflection of the optimization work we have done. We contrast the test results of MySQL 5.0,column-per-one-time and vector-per-one-time in different SCALE and analyze the promotion of performance in detail.Besides,we give the relation between the promotion of performance and size of data. All evidence has proved that our work is quite effectual.After a large number of experiments, results show that the effect of optimization is quite obvious when it aims at the characteristics of Column-Stores and modern CPU. However, the development of CPU is rapid and the new features and new technologies continue to emerge.So we will continue to optimizer our database with the characteristics of modern CPU. Due to time constraints,the performance optimization involved in this paper is achieved in the premise that all data is in the scope of main memory. The case that data beyond the main memory is the emphases of our next work. Then we will complete this task together with our colleagues who take charge of the module of storage management.
Keywords/Search Tags:Column-Stores, Column-per-one-time, Vector-per-one-time, GBase 8a, Cache, Performance Optimization
PDF Full Text Request
Related items