Font Size: a A A

Research On Iterative Distributed Data Processing Based On MapReduce

Posted on:2014-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:X J FengFull Text:PDF
GTID:2248330398959177Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The information era is an era of data. The requirement of data processing has far exceeded the capabilities of individual computers with the sharp increase of the amount of data in many domains, leading to the use of parallel/distributed computing strategies. While the traditional parallel programming technologies such as MPI and grid computing are complex and have poor scalability, and they can’t meet the increasing requirements of large-scale data processing, the demand of new programming models for large-scale data processing such as MapReduce becomes more and more urgent.MapReduce is first developed by Google as a programming framework for processing highly distributable problems across large-scale datasets using a large number of computers. Because of its simplicity, excellent fault-tolerance features and scalability, MapReduce greatly simplifies the parallel/distributed processing of mass data in a cluster, and has attracted many related researches upon its newborn. Nowadays, MapReduce has been widely used in more and more application scenarios.However, the existing implementations of the traditional MapReduce framework such as Hadoop and Sphere do not efficiently support iterative data processing yet, while the iterative computation is a very important kind of application and there are a great many algorithms that use multiple iterations in scientific computing, data mining, information retrieval, and machine learning, etc. Thus, how to improve the efficiency of iterative data processing based on MapReduce has become to be a very urgent research subject and has great practical value. To solve this problem, we take an in-depth study on this issue and present myHadoop, a modified version of Hadoop based on the MapReduce framework.By modifying the programming model and the task scheduler, using a new task parallel strategy, and adding a new loop control module as well as a new cache module, myHadoop not only extends the programming support of MapReduce for iterative data processing, but also improves the efficiency greatly. In this paper, we study how MapReduce handles with iterative algorithms and analyze the existing problems. Then, we describe the design and implementation of myHadoop in detail. In the end, we design experimental applications and evaluate the performance of myHadoop against Hadoop. We also discuss the number of linked mappers which the reducer streams its intermediate output to on myHadoop and the issue of non-iterative data processing.
Keywords/Search Tags:MapReduce, distributed, iterative, Hadoop, myHadoop
PDF Full Text Request
Related items