Research On Iterative Distributed Data Processing Based On MapReduce

Posted on:2014-01-05

Degree:Master

Type:Thesis

Country:China

Candidate:X J Feng

Full Text:PDF

GTID:2248330398959177

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

The information era is an era of data. The requirement of data processing has far exceeded the capabilities of individual computers with the sharp increase of the amount of data in many domains, leading to the use of parallel/distributed computing strategies. While the traditional parallel programming technologies such as MPI and grid computing are complex and have poor scalability, and they can’t meet the increasing requirements of large-scale data processing, the demand of new programming models for large-scale data processing such as MapReduce becomes more and more urgent.MapReduce is first developed by Google as a programming framework for processing highly distributable problems across large-scale datasets using a large number of computers. Because of its simplicity, excellent fault-tolerance features and scalability, MapReduce greatly simplifies the parallel/distributed processing of mass data in a cluster, and has attracted many related researches upon its newborn. Nowadays, MapReduce has been widely used in more and more application scenarios.However, the existing implementations of the traditional MapReduce framework such as Hadoop and Sphere do not efficiently support iterative data processing yet, while the iterative computation is a very important kind of application and there are a great many algorithms that use multiple iterations in scientific computing, data mining, information retrieval, and machine learning, etc. Thus, how to improve the efficiency of iterative data processing based on MapReduce has become to be a very urgent research subject and has great practical value. To solve this problem, we take an in-depth study on this issue and present myHadoop, a modified version of Hadoop based on the MapReduce framework.By modifying the programming model and the task scheduler, using a new task parallel strategy, and adding a new loop control module as well as a new cache module, myHadoop not only extends the programming support of MapReduce for iterative data processing, but also improves the efficiency greatly. In this paper, we study how MapReduce handles with iterative algorithms and analyze the existing problems. Then, we describe the design and implementation of myHadoop in detail. In the end, we design experimental applications and evaluate the performance of myHadoop against Hadoop. We also discuss the number of linked mappers which the reducer streams its intermediate output to on myHadoop and the issue of non-iterative data processing.

Keywords/Search Tags:

MapReduce, distributed, iterative, Hadoop, myHadoop

PDF Full Text Request

Related items

1	Research On MapReduce Scheduler For Iterative Applications
2	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
3	Research On Iterative Computations For Big Data In The Cloud
4	Optimization And Application Of SVM Algorithm Based On Hadoop Distributed Platform
5	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
6	Research On Distributed Processing Of Massive Video Data Based On Hadoop
7	Research Of Parameter Optimizational Distributed SVM Based On Hadoop Platform
8	Study On Iterative Mapreduce Computation Model For Clustering Analysis
9	GPU-In-Hadoop: MapReduce on Distributed Heterogeneous Platforms
10	Distributed Implementation Of Image Analysis Based On Hadoop