Font Size: a A A

The Design And Implementation Of A Data Warehouse Engine Based On Hadoop

Posted on:2016-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:D Y GaoFull Text:PDF
GTID:2308330467997067Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data warehouse system is one of the indispensable decision support systems in modern enterprise. Traditionally it is powered by relational database management systems and parallel database systems. In recent years, with the increasing needs for managing large volumes of data, traditional data warehouse systems have encountered big obstacle of scalability. On the other hand, big data processing technology Hadoop is becoming the foundation of the data management in many companies due to its high availability, high scalability, and very low cost. Originally Hadoop was comprised of a distributed file system called HDFS and a parallel computing framework called MapReduce. Soon after, several data warehouse engines are developed to provide SQL interface for Hadoop, among which Hive created at Facebook is the most widely adopted one. However, due to its use of MapReduce as the execution engine for relational queries, Hive inherits many performance problems of MapReduce and thus is outperformed by parallel database systems on TB level datasets.This paper presents the design and implementation of a high performance data warehouse engine on top of Hadoop. The system employs a hybrid architecture which uses HDFS as the storage layer and a parallel SQL execution engine as the computing layer. Use HDFS to store user data frees the engine from tasks like managing replicas of data and tolerating disk failures, while using a parallel SQL execution engine instead of MapReduce for query processing gives the system great performance comparable to parallel database systems. Unlike many existing Hadoop warehouse systems, the system is also fully transactional. This paper presents many aspects of the system, including architectural design, parallel query processing, transaction support, and columnar storage format. The system is implemented by modifying PostgreSQL kernel, the author’s main work is on the query executor module, such as extending the iterator execution model and implementing Parquet columnar storage inside PostgreSQL. Besides, the author also did the performance testing of the system.A comprehensive performance evaluation of the system using TPC-H benchmark is presented at the end. The test result shows that the system discussed in this paper is over10times faster than Hive on simple select queries and over40times faster on complex join queries.
Keywords/Search Tags:Data Warehouse Engine, Query Processing, Parallel Computing
PDF Full Text Request
Related items