Font Size: a A A

Cost-based Mapreduce Workflow Optimization

Posted on:2014-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:P HuangFull Text:PDF
GTID:2248330395967820Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the growth of Internet, large-scale data analysis has become critical to the success of modern enterprise. Meanwhile, with the emergence of cloud computing, companies are attracted to move their data analytics tasks to the cloud due to its flexible, on demand resources usage and pay-as-you-go pricing model. MapReduce has been widely recognized as an important tool for performing large-scale data analysis in the cloud. It provides a simple and fault-tolerance framework for users to process data-intensive analytics tasks in parallel across different physical machines.In this paper, we give an introduction to the following,(1) MapReduce and its different implementations.(2) some performance metrics as well as factors which affect the performance metrics.(3) a performance model based on I/O cost.(4) MapReduce workflow and why we need it.(5) some state-of-the-art MapReduce workflow engines and a simple comparison among them.We describe the design, architecture of an open source MapReduce workflow engine Crunch in detail and show how it works. We present an algorithm based on the I/O cost model to optimize Crunch. To verify the performance of improved Crunch, we implement a parallel recommendation algorithm, use Crunch and improved Crunch to run the algorithm on different dataset, compare the running time of them. The result of experiment shows that the improved Crunch performs better than Crunch.
Keywords/Search Tags:Cloud Computing, MapReduce, Hadoop, Workflow, Crunch
PDF Full Text Request
Related items