The Design And Implementation Of Parallel Computing Platform Based On MapReduce

Posted on:2009-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Wan

Full Text:PDF

GTID:2178360242983014

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of internet, the data produced by web increases heavily. It is an embarrassing position for website operators to face a "rich data, little knowledge" situation in front of massive data set. It is a must-choice to design a common and extensible platform which handles massive data effectively for website operators. Thus it is possible for them to mine the potential knowledge.MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment. Users specify a map function that processes a key/value pair to generate a set of intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.Based on the analysis of Google's MapReduce and according to our own features, we explore the platform of mass data management which is more common and extensible.At first, we propose the framework architecture of mass data parallel processing which is a client-scheduler and processor -data storage structure. At client end, users commit task by configurable XML document.During the design for layer of task scheduling and executing, we propose some key strategies, which is common platform strategy, load balancing strategy, intermediate results management strategy and fault-tolerant strategy. Following these strategies, we adopt master-dispatch-service framework. The master node is responsible for collecting status of each node; the dispatch node is to divide the task set to task units and dispatch them to service nodes, at last it would get the results; the service nodes take on actual data processing.Next, we design a distributed file system to handle mass data storage. That system owns fine performance and throughput, strong stability and robustness.Finally, we carry out runtime performance tests on the platform. Comparing the outcomes of stand-alone and parallelization data processing, we can draw the conclusion that parallelization processing is more effective. By executions of different tasks, we explore how to achieve the optimal performance under the limit of cluster scale.

Keywords/Search Tags:

mass data, MapReduce, parallel computing, distributed file system, load balancing, fault-tolerance, cluster

PDF Full Text Request

Related items

1	Distributed Database Cluster System Zd-ddb Design And Implementation
2	Load Balancing Problems For Parallel And Distributed Computing
3	A virtual distributed computing system
4	Design And Implementation Of Digital Organisms Traffic Scheduling System Load Balancing And Fault Tolerance Mechanisms
5	The Design And Implementation Of A MapReduce Computing Framework Based On GPU Cluster
6	The Principle And Design Of Distributed Computing Platform Based On Mapreduce
7	Runtime systems for load balancing and fault tolerance on distributed systems
8	Design And Implementation Of Fault Tolerance Technology For Distributed System
9	Distributed File System Zd-dfs Design And Implementation
10	Study On .NET Framework Based Distributed Parallel Computing System