Cost-based Mapreduce Workflow Optimization

Posted on:2014-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:P Huang

Full Text:PDF

GTID:2248330395967820

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the growth of Internet, large-scale data analysis has become critical to the success of modern enterprise. Meanwhile, with the emergence of cloud computing, companies are attracted to move their data analytics tasks to the cloud due to its flexible, on demand resources usage and pay-as-you-go pricing model. MapReduce has been widely recognized as an important tool for performing large-scale data analysis in the cloud. It provides a simple and fault-tolerance framework for users to process data-intensive analytics tasks in parallel across different physical machines.In this paper, we give an introduction to the following,(1) MapReduce and its different implementations.(2) some performance metrics as well as factors which affect the performance metrics.(3) a performance model based on I/O cost.(4) MapReduce workflow and why we need it.(5) some state-of-the-art MapReduce workflow engines and a simple comparison among them.We describe the design, architecture of an open source MapReduce workflow engine Crunch in detail and show how it works. We present an algorithm based on the I/O cost model to optimize Crunch. To verify the performance of improved Crunch, we implement a parallel recommendation algorithm, use Crunch and improved Crunch to run the algorithm on different dataset, compare the running time of them. The result of experiment shows that the improved Crunch performs better than Crunch.

Keywords/Search Tags:

Cloud Computing, MapReduce, Hadoop, Workflow, Crunch

PDF Full Text Request

Related items

1	Design And Implementation Of Visual Data Platform Based On MapReduce
2	Researches About Cloud Computing And Expolit And Test Hadoop Program
3	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
4	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
5	Research And Improvement Of The MapReduce Framework In Cloud Computing
6	Design And Implementation Of Hadoop-based Workflow Management System
7	The Design Of The Cloud Computing System Based On Hadoop
8	The Cloud Computing Based On Hadoop Platform And Log Analysis
9	Research On The Application Of Cloud Computing Based On Hadoop
10	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop