Font Size: a A A

Research And Implementation Of Compression For Structured Data On Hadoop Platform

Posted on:2016-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:B TianFull Text:PDF
GTID:2298330452966406Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of new applications of electronic commerce, social computing andInternet of things, the scale of the relevant data is showing a trend of rapid growth, big data ischanging people’s life, work and the way of thinking. Therefore, it is becoming more and moreimportant to mine potential useful information from big data to drive the decision support systemmore accurately and efficiently and is gradually becoming the focus of attention in the field of datascience.As the distributed storage and computing platform, Hadoop takes distributed file systemHDFS and distributed computing framework MapReduce as the core, it has become the de factostandard for big data processing. Data compression is an very important way to improve queryprocessing performance. Taking account of generality, HDFS uses the unified method to storestructured and unstructured data, and it provides support for common heavyweight compressionmethods, but decompression is required in the query processing and the overhead is quite high, sothe advantage of structured data in this area can not be fully reflected. In column-store systems, thelightweight compression methods are widely used, and query processing can be performed directlyon compressed data. However, the tuple reconstruction is an important performance bottleneck inthe query processing of the column-store system, especially in a distributed environment wherethe giant network overhead of tuple reconstruction restricts the query processing performance. Therow-column hybrid storage structure, which derives from the PAX storage model, combines theadvantages of row store and column store, and it can provide a good storage model for big dataprocessing in the distributed environment.The main purpose of this paper is to research the design and implementation of compressionfor structured data on the Hadoop platform. Firstly, the paper designs a row-column hybrid datapage storage structure on top of HDFS after analyzing the implementation principle of severalcommon lightweight data compression algorithm and the characteristics of the compressionalgorithm. Then, this paper proposes and implements the adaptive lightweight data compressionscheme based on MapReduce. By spliting the big data into data blocks and compressing themparallelly, the compressed data will be stored as the proposed hybrid storage structure with anewly designed data access interface and saved in HDFS. At the same time, the paper puts forwarda data structure of dynamic datanode select priority queue tree to optimize the cluster’s loadbalance of the data storage on each node. Finally, this paper analyses and proposes thecorresponding query realization scheme on the compressed data, and the query executes directlyon the compressed data so as to take full advantage of data compression.The experimental resultsconducted on the large-scale datasets demonstrate the effectiveness of the proposed strategy inreducing the amount of storage and improving query performance on structured data.
Keywords/Search Tags:Hadoop, Structured Data, Row-column Hybrid Storage, Data Compression, Query on Compressed Data
PDF Full Text Request
Related items