The Design And Implementation Of A Data Warehouse Engine Based On Hadoop

Posted on:2016-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:D Y Gao

Full Text:PDF

GTID:2308330467997067

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Data warehouse system is one of the indispensable decision support systems in modern enterprise. Traditionally it is powered by relational database management systems and parallel database systems. In recent years, with the increasing needs for managing large volumes of data, traditional data warehouse systems have encountered big obstacle of scalability. On the other hand, big data processing technology Hadoop is becoming the foundation of the data management in many companies due to its high availability, high scalability, and very low cost. Originally Hadoop was comprised of a distributed file system called HDFS and a parallel computing framework called MapReduce. Soon after, several data warehouse engines are developed to provide SQL interface for Hadoop, among which Hive created at Facebook is the most widely adopted one. However, due to its use of MapReduce as the execution engine for relational queries, Hive inherits many performance problems of MapReduce and thus is outperformed by parallel database systems on TB level datasets.This paper presents the design and implementation of a high performance data warehouse engine on top of Hadoop. The system employs a hybrid architecture which uses HDFS as the storage layer and a parallel SQL execution engine as the computing layer. Use HDFS to store user data frees the engine from tasks like managing replicas of data and tolerating disk failures, while using a parallel SQL execution engine instead of MapReduce for query processing gives the system great performance comparable to parallel database systems. Unlike many existing Hadoop warehouse systems, the system is also fully transactional. This paper presents many aspects of the system, including architectural design, parallel query processing, transaction support, and columnar storage format. The system is implemented by modifying PostgreSQL kernel, the author’s main work is on the query executor module, such as extending the iterator execution model and implementing Parquet columnar storage inside PostgreSQL. Besides, the author also did the performance testing of the system.A comprehensive performance evaluation of the system using TPC-H benchmark is presented at the end. The test result shows that the system discussed in this paper is over10times faster than Hive on simple select queries and over40times faster on complex join queries.

Keywords/Search Tags:

Data Warehouse Engine, Query Processing, Parallel Computing

PDF Full Text Request

Related items

1	Parallel Query Processing In Data Warehouse Management System
2	Research On Key Technologies Of Distributed Rank-aware Query Processing
3	Parallel Query And Optimization In Column-stores On CPU-GPU Architecture
4	Design And Implementation Of Data Warehouse And Complex Ad-hoc Query For Commercial Banks
5	Parallel Query Processing Techniques In Parallel Database System PBASE/2
6	Parallel Query Processing On Trajectory Data
7	Parallel Query Processing System On Large-scale RDF Data
8	Research On Fault-Tolerant Parallel Skyline Query Technology In Cloud Computing Environment
9	Research On Query Optimization Of Data Warehouse Based On Improved Ant Colony Algorithm
10	Object-oriented Databases Parallel Query Processing And Transaction Management,