Font Size: a A A

The Design And Implementation Of Parallel Computing Platform Based On MapReduce

Posted on:2009-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z WanFull Text:PDF
GTID:2178360242983014Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of internet, the data produced by web increases heavily. It is an embarrassing position for website operators to face a "rich data, little knowledge" situation in front of massive data set. It is a must-choice to design a common and extensible platform which handles massive data effectively for website operators. Thus it is possible for them to mine the potential knowledge.MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment. Users specify a map function that processes a key/value pair to generate a set of intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.Based on the analysis of Google's MapReduce and according to our own features, we explore the platform of mass data management which is more common and extensible.At first, we propose the framework architecture of mass data parallel processing which is a client-scheduler and processor -data storage structure. At client end, users commit task by configurable XML document.During the design for layer of task scheduling and executing, we propose some key strategies, which is common platform strategy, load balancing strategy, intermediate results management strategy and fault-tolerant strategy. Following these strategies, we adopt master-dispatch-service framework. The master node is responsible for collecting status of each node; the dispatch node is to divide the task set to task units and dispatch them to service nodes, at last it would get the results; the service nodes take on actual data processing.Next, we design a distributed file system to handle mass data storage. That system owns fine performance and throughput, strong stability and robustness.Finally, we carry out runtime performance tests on the platform. Comparing the outcomes of stand-alone and parallelization data processing, we can draw the conclusion that parallelization processing is more effective. By executions of different tasks, we explore how to achieve the optimal performance under the limit of cluster scale.
Keywords/Search Tags:mass data, MapReduce, parallel computing, distributed file system, load balancing, fault-tolerance, cluster
PDF Full Text Request
Related items