Font Size: a A A

Research On Key Technologies Of Column-Oriented Database For Big Data

Posted on:2018-06-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1368330542992949Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of emerging technologies such as mobile Internet,Internet of things and cloud computing,the data scale promoted by the blogs,e-commerce,and social networks is growing at an astonishing rate,which greatly accelerates the arrival of"big data era".To manage such a large amount of data and obtain the valuable information,the column-oriented database,which shows great advantages for big data,has attracted an extensive attention in the academia and industry.Focusing on the storage model and key technologies of column-oriented database,scholars carry out a wide range of research.At the same time,based on the theoretical research,a number of open source column-oriented database prototype systems are presented,such as C-Store,MonetDB,Sybase IQ and HBase.The results of TPC-H benchmark further show that the performance of column-oriented database is superior to that of traditional row-oriented database.Therefore,for big data,the research on key technologies of column-oriented database is of great importance.By adopting column as the basic unit of data storage and data query,column-oriented database has advantages on data compression,data indexing and data querying.In recent years,a series of results have been achieved around the above key technologies and their performance have been verified in column-oriented database prototype systems.However,with the growing column scale and column domain,these techniques exposes more and more defects,which need to be further improved and optimized.Therefore,from the perspective of theoretical basis and practical application,a series of algorithms and solutions are proposed in this dissertation.The main contents of this dissertation are as follows:(1)To enhance the efficiency of traditional dictionary compression algorithm under big data compression scenarios,a dynamic incremental dictionary compression algorithm is proposed.This algorithm adopts the dynamic incremental compression to further compress the strings stored in the dictionary and uses the CSB~+-Tree to encode and decode the dictionary.The dynamic incremental compression solves the dramatically increasing dictionary space as the compressed data domain becomes larger.Also,the CSB~+-Tree realizes the dictionary encoding and decoding.By the dynamic incremental compression and CSB~+-Tree,the proposed algorithm effectively increases the compression efficiency for big data.Moreover,the experimental results verify the efficiency and practicality of the dynamic incremental dictionary compression algorithm.(2)Focusing on the low accuracy and slow execution speed in the self-selection strategy of compression algorithms,a parallel self-selection strategy is proposed.This algorithm adopts the weighted naive Bayesian classification model to classify the compression algorithms and further uses the multi-core MapReduce model to realize the parallel selection and parallel compression.The weighted naive Bayesian classification model increases the accuracy of compression algorithm selection.Also,the multi-core Map Reduce model accelerates the execution speed of algorithm selection and compression.Moreover,the experimental results show that the parallel self-selection algorithm significantly improve the accuracy and execution speed for compression algorithm selection.(3)To address the poor convergence performance problem faced by the traditional database cracking under the sequential query pattern,a multi-pivot database cracking algorithm is proposed.Besides the boundary value of the query,the random value generated by the range of the column and the query boundary also listed as the pivot to crack the index,which effectively solves the dependency of traditional database cracking on the query pattern.Also,the experimental results show that the multi-pivot database cracking algorithm can achieve a good convergence performance under not only the random query pattern but also the sequential query pattern.(4)To solve the problem of the excessive long execution time of traditional hybrid indexing algorithm under the sequential query pattern,a radix-partioning based hybrid indexing algorithm is proposed.By dividing the index column into a number of data blocks which are inter-ordered and intra-disordered,this algorithm reduces the amount of data touched by each query,thus siginificantly decreasing the execution time.Also,the experimental results show that the proposed algorithm requires a short execution time under not only the random query pattern but also the sequential query pattern.(5)Focusing on the low query execution efficiency problem of column-oriented database on big data,a parallel query operation algorithm based on data access strategies and a parallel query execution algorithm based on dense tree are proposed.Adopting two new data access strategies,the parallel query operation algorithm realizes the the parallel execution of common data operators in column-oriented database.Through the parallelism of query plan and the parallelism of the query operator,the parallel query execution algorithm realizes the parallel execution of the query.Also,the experimental results show that the proposed parallel algorithms effectively improve the query execution efficiency of column-oriented database on big data.
Keywords/Search Tags:Column-Oriented Database, MapReduce Model, Data Compression, Adaptive Indexing, Query Execution
PDF Full Text Request
Related items