Font Size: a A A

Design And Implementation Of System Failure Recovery Mechanism In Enterprise Level MapReduce System

Posted on:2013-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:T D LeiFull Text:PDF
GTID:2268330392469547Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In2004, Google came up with the MapReduce programming model, which is oneof the key technologies in Cloud Computing, and this model also is used to deal withmassive amounts of data in the cluster. This core technology having been published inthe form of a paper by Google two years later, there is an open source implementationabout this technology called Hadoop. At present, both Google MapReduce engine orHadoop, are built in large cluster, which is comprised by lots of general personalcomputers. Although the general personal computer has some incomparable advantagesin expense and scalability when compared with the super computer, its stability andavailability cannot meet the needs of current industrial production.Based on the enterprise data processing engine named Platform MapReduce, whichreleased by Platform Computing company in July2011, this paper studied how toidentify system failure recovery mechanism just based on the original systemarchitecture. In order to achieve failure recovery, firstly, this paper proposed mastercandidates list to solve single point failure. That is to say, the new master will read theinformation like a snapshot and event-log about running state and make the system runagain to solve single point failure. Secondly, to meet the requirement in the enterpriseenvironment that the system needs to run ceaselessly, this paper came up with a series ofmethods such as a log, which is recording the job status and data from users in theshared file system. If some tasks failed because of some system reasons, the system canreconstruct the tasks using the logs and rerun those tasks. Thirdly, the failure recoverymechanisms need to be transparent to users. That is to say, when failures occurred, thesystem can automatically recover and does not need to rely on human assistance. Theway to solve this problem is based on the underlying communication library of thesystem. The underlying communication library can identify the failures easily by somemechanisms like heart beat and then notify the upper, which will cause the failurerecovery mechanism ran by upper automatically. Finally, since the MapReduce engineprocesses massive amounts of data, the efficiency of failure recovery does not meet therequirement of enterprises. This paper introduced a skill named paging to handle thisproplem, which improving the efficiency of failure recovery.Through the design, implementation and testing of those solutions, the problems above can be solved, met the requirement of enterprise-level MapReduce system failurerecover mechanism.
Keywords/Search Tags:MapReduce, large-scale cluster, failure recovery, event-log, snapshot
PDF Full Text Request
Related items