Font Size: a A A

Research On Some Key Technologies For Column-stores

Posted on:2014-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W DingFull Text:PDF
GTID:1228330395481280Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Analytical DBMS with high performance is greatly required by high level managers in many enterprises for big data analysis and decision making assistance. Column-store is still on research due to its high efficiency in big data query, and many research progenies have been made. However, some key technologies in column-stores such as storage optimization, query optimization and query execution are still needed to be further studied.Data in relational table is stored by columns in column-stores. Processing a query by reading columns that are only needed by the query is able to avoid reading unnecessary columns in the table from disk in column stores. The data stored by columns has better compressibility than that stored by row, and queries can be evaluated directly on compressed data. These two characteristics make column store better input efficiency than traditional row store, and improve query processing performance. On the other hand, when processing queries on data stored by column, attribute values stored separately in different columns are extracted to reconstruct into tuples. Tuple reconstruction is an important performance bottleneck in column stores.Relying on the subject of major science and technology project "Development of a DBMS prototype for data warehouse"(Grant No.2010ZX01042-001-003-04) supported by Ministry of Industry and Informatilization of China, and on the project "An optimized storage model with row and column mixture in data warehouse"(Grant No.61070031) supported by National Natural Science Foundation of China, this thesis studied some key technologies in column-stores, aiming at improving query performance. The major contributions of this thesis are as follows:(1)We proposed a combined multi-column store model after studied the effects of data storage layout on performance of tuple reconstruction. System analyses automatically users’ query history, the column groups that were output together frequently from a logical table are generated, then every one of these column groups are materialized. For a column group needed to be materialized, a logical projection is firstly formed and a horizontal partiontion to the projection is generated. With a block, data are organized by column, then compressed and stored. This approach takes fully advantage of column store, and reduces substantially cost of tuple-reconstruction, and provides optimal storage for later queries.(2) Traditional B+tree indices are sparse, and searching paths within them are long. It is not fit for analytic workload due to its low insertion and search efficiency. In the thesis, we proposed the structure of reduced B+tree (RB+tree for short) which suites for column stores. A RB+tree is almost a full and balanced binary tree, in which each node is able to accommodate more indexed entries, therefore, search paths are short. So, RB+tree is able to improve the performance of query processing.A bottom-to-up creation and maintaining approach for rowID indices and column value indices in RB+tree was presented in this thesis.(3) We studied data compression technologies in database, including lightweight compression, the selection strategy of compression granularity and compression scheme. Based on intensive study of bitmap compressions, we proposed a rich-extend partition bitmap index approach and a self-adaptive partition word-aligned bit vector compression approach (APWAH). Each rich-extend partition bitmap index contained the statistics information which could facilitate directive aggregate operation on partition bitmaps. APWAH adaptively selectes the most suitable0-fill and1-fill segment length according to0-1distribution in a bit vector, thus it could hava better compression effects and improve query performance. Sector-level compression was studied as well. Two merits of sector-level compression were its high compressing ratio and its simplity of compression management. This thesis also presented that compressing subsystem in column stores should select sector size adaptive for data distribution. One sector is composited with some blocks, the number of blocks in different sectors might be different. Thus compressing subsystem could flexibly partion a column into some sectors according to similarity of adjoining block, and it is not nesesary that all sectors are of the same size. This ensures that the similarity of data in a sector was strong, and the similarity of data between adjoining sectors was weak, to select most appropriate compressing approach for each sector. Moreover, a data distribution feature model was built, and a decision scheme for selecting compressing approach was created with proposed model. (4) We proposed a3-level buffer management solution suitable for column-stores after studied buffer managing technologies. On the global level, a free buffer chain and in-use buffer chain are constructed, resectively. The blocks in the in-use chain are replaced by general adaptive replacement policy. The buffer replacement decision is made according to not only the running queries but also some afterwards queries. On the query level, a main chain is maintained for every query’ in-use buffer. When a concurrent operation phase occurs, a branch buffer chain is generated from the main buffer chain to manage buffer used by the concurrent phase, respectively. On the operation phase level, we designed a flexible and adaptive buffer allocation approach-MG-x-y-z for every operation phase and replacement policy suitable for access pattern of the operation phase. The proposed3-level buffer strategy took fully into account not only the characteristics of analytic workload, pattern of accessing data and available buffer, but also other factors like pre-fetching data etc.(5) Finaly, we proposed a materializing technology based paths with value (PVM) after studied the materialization strategies in column stores. PVM added paths with value to query execute tree, and utilized passing-block to hold intermediate results. This technique avoided reading original data blocks repeatingly. For the intermediate results of bit vectors in PVM, executing subsystem calle the proposed APWAH compression approach to compress it, which could reduce or avoid large I/O cost due to large bit vector in intermediate results.The mentioned technologies above were the key components of our column store DBMS prototype system. These researched results improve the overall performance of the system.
Keywords/Search Tags:Column store, DWMS, Multi-column materialization, RB+TreeIndex, Sector-level Compression, Rich Extended Bitmap, APWAH CompressionApproach, Hierarchical Buffer Management, materialization based path with value(PVM)
PDF Full Text Request
Related items