Research And Implementation Of Distributed ETL Based On Hadoop Platform

Posted on:2015-02-15

Degree:Master

Type:Thesis

Country:China

Candidate:G He

Full Text:PDF

GTID:2268330425482040

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data extraction, transformation and load (ETL) is a key technique of the implementation of high quality in the field of data warehouse. And it is also a core technique that provides the valid data for the high-level decision makers. It is an important problem urgently solved, which is the mass data loaded into data warehouse rapidly by the ETL technique. And it is also the common concerned topic in the field of data warehouse.Data warehouse theory knowledge and the distributed processing technique of mass data are adopted in this paper. This paper focuses on distributed ETL framework, parallel processing of data and the optimization approach of the HDFS data blocks assignment.The main researches and works made by this paperâ€™s author are described as follows:Firstly, the design of the distributed ETL framework. MapReduce work mechanism and job scheduling are analyzed under the platform of Hadoop. And the framework is based on the theory of dimension modeling in the data warehouse. This paper designs a distributed ETL framework which contains the parallel processing of dimension and facts and the assignment of HDFS data blocks.Secondly, the research of parallel processing of facts. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The experiment results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.Thirdly, the research of HDFS data blocks assignment algorithm. An assignment algorithm to assign HDFS data blocks to distributed data warehouse based on the theory of minimum cost and maximum flow of network flow is presented. The method of the improved shortest augmenting path is applied to solve the maximum flow. Network distance and load balance of nodes are considered as network cost. The experiments indicate that the assignment algorithm behaves much more efficient than the existing methods.Finally, the implementation procedure of the distributed ETL system under the platform of Hadoop is given out. The system is more efficient than the existing methods.

Keywords/Search Tags:

Hadoop, distributed ETL, data processing, assignment algorithm

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Query Algorithm Processing Communication Data Based On Hadoop
2	Research And Implementation On Incremental Data Processing Algorithm Based On Hadoop
3	Massive Data Processing Application Based On Hadoop
4	Research On Distributed Processing Of Massive Video Data Based On Hadoop
5	Research And Implementation Of Big Data Oriented Distributed OLAP Engine
6	The Research And Analysis Of Hadoop Small File Processing Method
7	Hadoop-based Network Verification Platform Research
8	Research And Application Of Telecom Big Data Processing Based On Hadoop
9	Key Technology Research-based The Hadoop Of Massive Data Processing
10	Study On Key Techniques Of Distributed Data Mining Based On Hadoop