Implementation Of Parallel ETL Components Based On Mapreduce Job’s Split And Combination Mechanism

Posted on:2015-10-13

Degree:Master

Type:Thesis

Country:China

Candidate:X G Liu

Full Text:PDF

GTID:2298330467463099

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Analyzing large amount of data becomes increasingly important in this impatient information age. The size of data being collected and analyzed is growing rapidly.This makes traditional solutions prohibitively expensive.Facing these massive data,companies want to use data mining algorithm to discover the potential information which contains enormous commercial value. ETL(Extract-Transform-Load)is the premise of data mining,it provides more concise data,and makes data mining more convenient.Therefore processing big data with ETL algorithms has an important practical significance.MapReduce and its open source implementation Hadoop provide an efficient way to process big data and have been widely recognized as an efficient tool for large-scale data analysis. Nowadays Hadoop-based ETL tool is in great demand, some complex ETL scenes contains many operations,and each operation needs a MapReduce job.The I/O operations and network transfer between these jobs slows down the-whole operation.To cope with this problem Hadoop provides a chain-MapReduce framework,which can merge several MapReduce jobs into one big job,but this framework still has its shortcomings. After researching the existed ETL tool on big data platform and the methods, of processing big data,this paper presents an improved chain-MapReduce,and applies it to a ETL tool based on Hadoop platform.After researching the existed ETL tool on big data platform and the methods of processing big data,this paper presents an improved chain-MapReduce,and applies it to a ETL tool based on hadoop platform,this tool is in B/S mode,it forms a data flow through draging and droping algorithm component.The contents of this paper are as follows:1.Knowing methods of processing big data,prepareing for the designing of optimization rules and the improved chain-MapReduce framework through researching the existed ETL tool on big data platform and the execution of MapReduce job.2.This paper presents an improved chain-MapReduce base on some open source ETL tools and the characteristic of MapReduce,then applies it to an parallel ETL tool and designs a new workflow engine accordingly.3.This paper designs several optimization rules of process level based on the characteristic of MapReduce and ETL algorithms.These rules will make the ETL flow generate less MapReduce job and less10and disk costs.This paper aslo has some improvement of the ETL algorithms.4.Making a performance test between this ETL tool and Hive on real queries and real big datasets,the data is a province’s mobile phones’s internet surfing logs.

Keywords/Search Tags:

chain, MapReduce, ETL, optimization rules

PDF Full Text Request

Related items

1	The Design And Implementation Of Distributed Rules System Based On MapReduce
2	Research And Application Of Association Rules Algorithm Based On MapReduce
3	Research On Association Rules Parallel Optimization Algorithm And Application
4	Research On Association Rules Mining For Marine Environmental Data Using MapReduce
5	Research On Fuzzy Rules Extraction Of Futures Trading Algorithm And Application Based On MapReduce
6	Research And Application Of Multidimensional Data Constructing And Association Rules Mining Algorithm Based On Mapreduce
7	Research On Improved Association-rules Algorithm Based On Mapreduce
8	The Parallel Association Rules Algorithm Based On Mapreduce In The Application Of Community Analysis Research
9	Implementation And Optimization Of Parallel Clustering Algorithm Based On MapReduce
10	Optimization Design Of MapReduce Based On SDN