Font Size: a A A

Design And Implementation Of Hive Transaction Table Fragment Version File Compaction System

Posted on:2020-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:J W ChenFull Text:PDF
GTID:2518305732997819Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,people are generating more and more data every day.Using traditional relational databases for ELT analysis has become inadequate.The emergence of MapReduce-based data warehouses such as Hive and Pig provides a solution to this problem.Hive supports transaction operations of snapshot isolation since version 0.14.If Hive accesses streaming data or has frequent DML operations in a high-concurrency scenario,because of the snapshot isolation for transaction operations,which will result in excessive snapshot files on the underlying HDFS.Excessive snapshot files not only take up too much storage space,but also exert pressure on the HDFS NameNode,and it also seriously affects the read performance of the Hive transaction table.This thesis designs and implements a compaction system independent of the data warehouse from the problem of too many small files encountered during the use of Hive transaction tables.The standalone compaction system designed and implemented in this thesis has the following main features:1.The system performs a Major compact or a Minor compact of the transaction table according to the statistics of it,which provides basic small file compaction function for the transaction table of the Hive data warehouse.Based on this,the compaction system allows users to manually trigger the compaction of transaction tables through the WebUI,which is easier to use than the original command line.2.The system provides the function of automatically or manually collecting the statistical information of the transaction table in the warehouse.By collecting the statistical information,the database administrator can know the basic situation of the transaction table in the warehouse and whether the data of some table is skewed.3.The system provides a WebUI that supports database administrators to maintain compaction blacklists,compaction queues and transaction table statistics,which can facilitate administrators to troubleshoot problems.It's easier to use than the originally command line.In the cluster where the security service is enabled,the compaction system ensures the legality of the operation on the WebUI through the CAS center login and permission check.4.The system is designed to increase its availability through an ActiveStandby architecture.This thesis proves through experiments that in the Hive cluster with compaction service,the compaction system can effectively compact transaction tables with many fragmented version files.This thesis performs query operations on transaction tables with different numbers of fragmented version files by using several query SQLs in the TPC-DS Benchmark.Experiments show that the query performance of the transaction table can be effectively improved compared to the same transaction table without compaction.In the technology selection,the system is implemented in Java language.Thrift framework is used for service communication.The WebUI is implemented with the usage of Jetty,Servlet and Velocity engine.The statistics metadata is stored in the MetaStore provided by Hive,the transaction table file is stored in HDFS,and the compaction operation is executed by MapReduce.
Keywords/Search Tags:database, compact system, MVCC snapshot files management
PDF Full Text Request
Related items