Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce

Posted on:2019-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:P Wang

Full Text:PDF

GTID:2428330563956738

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,the data show an explosive growth in recent years.Cloud computing and big data processing technologies have become an indispensable and important tool for taking advantage of data.Google's MapReduce computing model is one of the most effective technologies.Failure has become a common problem in cluster,which makes MapReduce applications suffer from poor performance if no reasonable fault tolerant techniques are deployed.Hadoop,the most popular open source implementation of MapReduce,provides a convenient framework for users with basic fault tolerance capability.But the coarse grained strategy,which is re-execution,brings amount of overload for fault recovery.Thus,jobs could be heavily delayed when failure happens.This thesis analyzes the Hadoop Map Reduce fault tolerance scheme,and proposes a comprehensive fault tolerance scheme,TRCID,including fault monitoring and task re-execution.It aims to reduce the delay of error monitoring and the workload of task recalculation by deploying checkpoint technology.It makes task can be resumed as soon as possible after the failure occurs.So that the efficiency of the whole operation can be improved.On the one hand,a fault tolerance scheme of multi-level checkpoint is proposed to deal with different types of failures.Though pushing the intermediate data in time instead of the original pulling scheme,the re-execution is not necessary in most of the failure scenarios and the overhead caused by fault-tolerant is minimized.On the other hand,TRCID monitors nodes by multiple performance indicators through Hadoop heartbeat.Compared with the original timeout strategy,TRCID can detect failures more timely and make relevant schedule.The experiment is performed based on Hadoop.Frist of all,we run the overload estimation by evaluating execution time and hardware indicators.Then we purpose three scenarios based on workflow of TRCID and failure types.Experiments are designed considering job scale,failure rate and performed individually in three scenarios.The experiments show that TRCID outperforms original Hadoop fault tolerant scheme and decrease the infection of failures.

Keywords/Search Tags:

MapReduce, Hadoop, Fault Tolerance, Checkpoint, Failure Detection

PDF Full Text Request

Related items

1	Research And Implementation Of Mapreduce Fault Tolerance Method Based On Intermediate Result Checkpoint
2	Research And Optimization Of Adaptive Checkpoint Technique In Map Reduce
3	Research On Improving The Fault Tolerance Performance In MapReduce
4	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
5	Research And Optimization Of Mapreduce Fault Tolerance Mechanisms
6	Research On Decentralized MapReduce Of Node Failure And Fault Tolerance Mechanism
7	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
8	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
9	A Checkpoint-Based Fault-Tolerant Service In Distributed Systems
10	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint