Research And Implementation Of Compression For Structured Data On Hadoop Platform

Posted on:2016-10-18

Degree:Master

Type:Thesis

Country:China

Candidate:B Tian

Full Text:PDF

GTID:2298330452966406

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of new applications of electronic commerce, social computing andInternet of things, the scale of the relevant data is showing a trend of rapid growth, big data ischanging people’s life, work and the way of thinking. Therefore, it is becoming more and moreimportant to mine potential useful information from big data to drive the decision support systemmore accurately and efficiently and is gradually becoming the focus of attention in the field of datascience.As the distributed storage and computing platform, Hadoop takes distributed file systemHDFS and distributed computing framework MapReduce as the core, it has become the de factostandard for big data processing. Data compression is an very important way to improve queryprocessing performance. Taking account of generality, HDFS uses the unified method to storestructured and unstructured data, and it provides support for common heavyweight compressionmethods, but decompression is required in the query processing and the overhead is quite high, sothe advantage of structured data in this area can not be fully reflected. In column-store systems, thelightweight compression methods are widely used, and query processing can be performed directlyon compressed data. However, the tuple reconstruction is an important performance bottleneck inthe query processing of the column-store system, especially in a distributed environment wherethe giant network overhead of tuple reconstruction restricts the query processing performance. Therow-column hybrid storage structure, which derives from the PAX storage model, combines theadvantages of row store and column store, and it can provide a good storage model for big dataprocessing in the distributed environment.The main purpose of this paper is to research the design and implementation of compressionfor structured data on the Hadoop platform. Firstly, the paper designs a row-column hybrid datapage storage structure on top of HDFS after analyzing the implementation principle of severalcommon lightweight data compression algorithm and the characteristics of the compressionalgorithm. Then, this paper proposes and implements the adaptive lightweight data compressionscheme based on MapReduce. By spliting the big data into data blocks and compressing themparallelly, the compressed data will be stored as the proposed hybrid storage structure with anewly designed data access interface and saved in HDFS. At the same time, the paper puts forwarda data structure of dynamic datanode select priority queue tree to optimize the cluster’s loadbalance of the data storage on each node. Finally, this paper analyses and proposes thecorresponding query realization scheme on the compressed data, and the query executes directlyon the compressed data so as to take full advantage of data compression.The experimental resultsconducted on the large-scale datasets demonstrate the effectiveness of the proposed strategy inreducing the amount of storage and improving query performance on structured data.

Keywords/Search Tags:

Hadoop, Structured Data, Row-column Hybrid Storage, Data Compression, Query on Compressed Data

PDF Full Text Request

Related items

1	Implementation Of Data Compression, Operation And Query Processing System Based On BAP
2	Research Of Hybrid Row-column Storage Model Based On Hadoop
3	Storage And Querying Optimization For Large Scale Structured Data
4	Design And Implementation Of Data Dictionaries In Column Storage DWMS
5	Research On The Solution Of Hybrid Storage Based On Hadoop
6	Research And Implementation Of Data Compression Based On Column-Oriented Database System
7	The Research On Compression And Query Processing Methods Of Scientific And Statistical Databases
8	Research And Implementation Of Query Optimization In Column-Oriented Compressed Data
9	Compressed Storage And Query Method Of RFID Data In The Internet Of Things
10	Research On Compression, Operation And Query Processing Methods Of Massive Datasets