Font Size: a A A

A Design And Implementation On Storage Structure Extension Of Big Data Warehouse Hive

Posted on:2016-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2348330503494319Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Hadoop has become a popular open-source platform in the field of big data, and has formed an almost complete "ecosystem". Apache Hive, created by Facebook, is an open source data warehouse, which supports SQL query on top of Hadoop. It transforms the SQL query into MapReduce jobs, submits them to Hadoop, and finally achieves the query results. Currently, there are two major problems that limit the further adoption of Hive: too long query-response time and too much data storage space requirement. Both academic and industrial studies have been working on these two problems. Most of the work focuses on the optimization of the SQ L parser, parallel data processing, optimization of the storage formats, I/O utilization optimization,dynamic allocation of Reduce computing resources and HDFS RAID hierarchical storage.To solve the above two problems, this thesis proposes an improved storage format called FOSF, short for Flexible Optimized Segment File. The FOSF storage format is based on the analysis of the HDFS(Hadoop Distributed File System), the MapReduce framework, and two dimensional table storage technology, in the background of the data center project of a well-known communication company(referred to as H Company). FOSF implements Hive's StorageHandler interface. In the conducted TPC-H experiments, FOSF saves about 20% query time, about 50% storage space, and about 10% data loading time compared to the current storage formats in Hive.The main work of this paper is as follows:First, the current storage format RecordColumnFile will always load the whole filtering column into memory, regardless of the number of rows which satisfy the filter conditions. To solve this problem, this thesis proposes a columnar index algorithm based on column metadata. This algorithm builds the index of the maximums and minimums of column data. For queries with filters, it will help skip the records that do not satisfy the filter conditions, instead of loading the whole column into memory. Experiments show that only 1/4 data is loaded into memory compared to RecordColumnFile.Second, current compression algorithms in Hive, such as LZO, do not consider the actual data distribution, thus cannot provide better compression ratio. In order to save storage space and improve data loading efficiency, this thesis proposes three data compression algorithms, and provides an adaptive decision method of these algorithms, based on the data distribution across each column. The three algorithms are suitable for the arithmetic progression sequences, lots-of-duplicated-values sequences, and small adjacent increment sequences, respectively. The adaptive decision algorithm is responsible for selecting the suitable compression algorithm for each column, to achieve better compression performance. Experiments show that the proposed algorithm improves about 50% compression ratio, reduces about 10% compression time and 10% decompression time.Third, FOSF implements the StorageHandler interface in Hive. FOSF provides the columnar index based on metadata, self-adaptive compression based on data distribution, and hybrid storage.
Keywords/Search Tags:Hadoop, Hive, SQL, Storage, Adaptive Compression, Column Index
PDF Full Text Request
Related items