Font Size: a A A

Research And Implementation Of Distributed ETL Based On Hadoop Platform

Posted on:2015-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:G HeFull Text:PDF
GTID:2268330425482040Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data extraction, transformation and load (ETL) is a key technique of the implementation of high quality in the field of data warehouse. And it is also a core technique that provides the valid data for the high-level decision makers. It is an important problem urgently solved, which is the mass data loaded into data warehouse rapidly by the ETL technique. And it is also the common concerned topic in the field of data warehouse.Data warehouse theory knowledge and the distributed processing technique of mass data are adopted in this paper. This paper focuses on distributed ETL framework, parallel processing of data and the optimization approach of the HDFS data blocks assignment.The main researches and works made by this paper’s author are described as follows:Firstly, the design of the distributed ETL framework. MapReduce work mechanism and job scheduling are analyzed under the platform of Hadoop. And the framework is based on the theory of dimension modeling in the data warehouse. This paper designs a distributed ETL framework which contains the parallel processing of dimension and facts and the assignment of HDFS data blocks.Secondly, the research of parallel processing of facts. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The experiment results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.Thirdly, the research of HDFS data blocks assignment algorithm. An assignment algorithm to assign HDFS data blocks to distributed data warehouse based on the theory of minimum cost and maximum flow of network flow is presented. The method of the improved shortest augmenting path is applied to solve the maximum flow. Network distance and load balance of nodes are considered as network cost. The experiments indicate that the assignment algorithm behaves much more efficient than the existing methods.Finally, the implementation procedure of the distributed ETL system under the platform of Hadoop is given out. The system is more efficient than the existing methods.
Keywords/Search Tags:Hadoop, distributed ETL, data processing, assignment algorithm
PDF Full Text Request
Related items