Design And Implementation Of System Failure Recovery Mechanism In Enterprise Level MapReduce System

Posted on:2013-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:T D Lei

Full Text:PDF

GTID:2268330392469547

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In2004, Google came up with the MapReduce programming model, which is oneof the key technologies in Cloud Computing, and this model also is used to deal withmassive amounts of data in the cluster. This core technology having been published inthe form of a paper by Google two years later, there is an open source implementationabout this technology called Hadoop. At present, both Google MapReduce engine orHadoop, are built in large cluster, which is comprised by lots of general personalcomputers. Although the general personal computer has some incomparable advantagesin expense and scalability when compared with the super computer, its stability andavailability cannot meet the needs of current industrial production.Based on the enterprise data processing engine named Platform MapReduce, whichreleased by Platform Computing company in July2011, this paper studied how toidentify system failure recovery mechanism just based on the original systemarchitecture. In order to achieve failure recovery, firstly, this paper proposed mastercandidates list to solve single point failure. That is to say, the new master will read theinformation like a snapshot and event-log about running state and make the system runagain to solve single point failure. Secondly, to meet the requirement in the enterpriseenvironment that the system needs to run ceaselessly, this paper came up with a series ofmethods such as a log, which is recording the job status and data from users in theshared file system. If some tasks failed because of some system reasons, the system canreconstruct the tasks using the logs and rerun those tasks. Thirdly, the failure recoverymechanisms need to be transparent to users. That is to say, when failures occurred, thesystem can automatically recover and does not need to rely on human assistance. Theway to solve this problem is based on the underlying communication library of thesystem. The underlying communication library can identify the failures easily by somemechanisms like heart beat and then notify the upper, which will cause the failurerecovery mechanism ran by upper automatically. Finally, since the MapReduce engineprocesses massive amounts of data, the efficiency of failure recovery does not meet therequirement of enterprises. This paper introduced a skill named paging to handle thisproplem, which improving the efficiency of failure recovery.Through the design, implementation and testing of those solutions, the problems above can be solved, met the requirement of enterprise-level MapReduce system failurerecover mechanism.

Keywords/Search Tags:

MapReduce, large-scale cluster, failure recovery, event-log, snapshot

PDF Full Text Request

Related items

1	Design And Implementation Of The Failure Recovery Mechanism In MapReduce
2	Social Networks, The Recovery Of The Performance Impact Of Large-scale Distributed Systems Research
3	Online New Event Detection For Large Scale Dataset
4	Research And Application Of Clustering Algorithms For Large Scale Data
5	Large-scale High-performance Computer Cluster Failure Rapid Diagnosis And Automatic Recovery System Developed
6	Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce
7	Research On Softedge Blending Technology Application In Large-scale Live Event
8	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
9	Application-Aware On-Line Failure Recovery For Extreme-Scale HPC Environment
10	The Optimization Of High Performance MapReduce FairScheduler And The Implementation On Simulator Of Huge Scale Cluster