Font Size: a A A

Research And Implementation Of Data Compression Based On Column-Oriented Database System

Posted on:2011-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:P HuangFull Text:PDF
GTID:2178360305955062Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Relational Database Systems have now developed for several decades, during this period, the scale of data has increased largely and queries have become more sophisticated, while data-processing technology in databases still remains unchanged. Column-Oriented Database is specifically designed to handle large-scale data. It is mainly used for online analytical processing and other write-optimal applications, such as data mining and decision support systems. Column-Oriented Database remains relational data model, its advantages lies in high-performance when processing large scale of data.In fact, Column-Oriented Databases are just another implementation of relational databases. Contrasting with the traditional row-store database, which processes data row by row, Column-Oriented Database stores data by column. This storage model brings the very perfect granularity that it makes it unnecessary for the database to have to read many irrelevant data, thus saves the I/O bandwidth. Besides, it's very easy for Column-Oriented Database to compress data efficiently. In this paper, these characteristics are deeply researched.With the size of data increasing in database applications, researchers and developers will often seek a rational and effective compression method to prevent data exploration, because compression not only saves disk space, but also increases performance by reducing the size of data. The data stored in Column-Oriented Databases are especially easy for compressing as adjacent data are of the same type such as integer, string, etc. Meanwhile, in order to minimize the overhead of data decompression, light-weighted compression methods can be used. In this case, the data can be easily obtained by accessing compressed data without decompressing. This can also improve database performance effectively.Light-weighted compression is simply based on the regularity of data, the most suitable compression algorithms in the database are those which have acceptable compression ratios and can be quickly decompressed. Light-weighted compression algorithm is characterized by fast decompression, or it's even not necessary to decompress, because in some cases we can access the compressed data directly. In this paper, the main lightweight compression algorithms are studied, and a new compression algorithm called frequent-bytes compression is proposed, which in some cases, is very effective.Light-weighted compression algorithms can be easily applied in Column-Oriented Databases and they have advantages that compression is simple and often do not require decompression when accessing. A very important problem to solve in Column-Oriented Database is how to access compressed data efficiently. In this paper, many SQL operators are studied, considering the data processing procedure of Column-Oriented Databases and the nature characteristics of light-weighted compressing algorithms, this paper analyzed the performance of accessing compressed data of different types. The results show that using of light-weight compression algorithms can compress data efficiently, without sacrificing the performance of SQL operations.To further improve performance of compressed data, three issues are proposed in this paper: Compressed data searching (or compressed matching problem), the conversion of offset to row ID and compressed data locating (or compressed accessing).Compressed searching is also called compressed matching problem or CMP, its main task is to try to find all occurrences of given patterns in compressed text without decompression. In databases, the SELECT operator is actually the application of CMP. The searching methods of compressed data are studied, and as an example, this paper studies pattern matching in Huffman compressed text based on the automaton deeply, as well as the using of super-alphabet. We got the fittest lengths of super-words for different size of data from careful computation, and experiments show that these values have strong guiding significance.The occurrences of a pattern obtained by compressed searching are often offsets of the compressed data block. But what do they mean for database? In database querying all we need is just row-ids of the result set, so it is necessary to get row-ids from the searched offsets. This conversion is a must when both compressed searching and compressed accessing problems are solved.Compressed data accessing is the problem which try to retrieve the n-th element from compressed data without decompression. Int-packing compressed data are deeply studied in this paper, and to better retrieve values from compressed data, the int-packing compression algorithm is improved. In this paper, Pack2, Pack3 and Pack4 compression algorithms are proposed and their corresponding compressed accessing algorithms Pack2Locate, Pack3Locate and Pack4Locate are invented. Experiments show that Pack3 especially Pack4 compression algorithm is very good for compressed accessing.
Keywords/Search Tags:Column-Oriented DBMS, Data Compression, Compressed Data Accessing, Compressed Matching Problem
PDF Full Text Request
Related items