Font Size: a A A

Implementation Of Parallel ETL Components Based On Mapreduce Job’s Split And Combination Mechanism

Posted on:2015-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:X G LiuFull Text:PDF
GTID:2298330467463099Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Analyzing large amount of data becomes increasingly important in this impatient information age. The size of data being collected and analyzed is growing rapidly.This makes traditional solutions prohibitively expensive.Facing these massive data,companies want to use data mining algorithm to discover the potential information which contains enormous commercial value. ETL(Extract-Transform-Load)is the premise of data mining,it provides more concise data,and makes data mining more convenient.Therefore processing big data with ETL algorithms has an important practical significance.MapReduce and its open source implementation Hadoop provide an efficient way to process big data and have been widely recognized as an efficient tool for large-scale data analysis. Nowadays Hadoop-based ETL tool is in great demand, some complex ETL scenes contains many operations,and each operation needs a MapReduce job.The I/O operations and network transfer between these jobs slows down the-whole operation.To cope with this problem Hadoop provides a chain-MapReduce framework,which can merge several MapReduce jobs into one big job,but this framework still has its shortcomings. After researching the existed ETL tool on big data platform and the methods, of processing big data,this paper presents an improved chain-MapReduce,and applies it to a ETL tool based on Hadoop platform.After researching the existed ETL tool on big data platform and the methods of processing big data,this paper presents an improved chain-MapReduce,and applies it to a ETL tool based on hadoop platform,this tool is in B/S mode,it forms a data flow through draging and droping algorithm component.The contents of this paper are as follows:1.Knowing methods of processing big data,prepareing for the designing of optimization rules and the improved chain-MapReduce framework through researching the existed ETL tool on big data platform and the execution of MapReduce job.2.This paper presents an improved chain-MapReduce base on some open source ETL tools and the characteristic of MapReduce,then applies it to an parallel ETL tool and designs a new workflow engine accordingly.3.This paper designs several optimization rules of process level based on the characteristic of MapReduce and ETL algorithms.These rules will make the ETL flow generate less MapReduce job and less10and disk costs.This paper aslo has some improvement of the ETL algorithms.4.Making a performance test between this ETL tool and Hive on real queries and real big datasets,the data is a province’s mobile phones’s internet surfing logs.
Keywords/Search Tags:chain, MapReduce, ETL, optimization rules
PDF Full Text Request
Related items