Font Size: a A A

Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce

Posted on:2019-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2428330563956738Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the data show an explosive growth in recent years.Cloud computing and big data processing technologies have become an indispensable and important tool for taking advantage of data.Google's MapReduce computing model is one of the most effective technologies.Failure has become a common problem in cluster,which makes MapReduce applications suffer from poor performance if no reasonable fault tolerant techniques are deployed.Hadoop,the most popular open source implementation of MapReduce,provides a convenient framework for users with basic fault tolerance capability.But the coarse grained strategy,which is re-execution,brings amount of overload for fault recovery.Thus,jobs could be heavily delayed when failure happens.This thesis analyzes the Hadoop Map Reduce fault tolerance scheme,and proposes a comprehensive fault tolerance scheme,TRCID,including fault monitoring and task re-execution.It aims to reduce the delay of error monitoring and the workload of task recalculation by deploying checkpoint technology.It makes task can be resumed as soon as possible after the failure occurs.So that the efficiency of the whole operation can be improved.On the one hand,a fault tolerance scheme of multi-level checkpoint is proposed to deal with different types of failures.Though pushing the intermediate data in time instead of the original pulling scheme,the re-execution is not necessary in most of the failure scenarios and the overhead caused by fault-tolerant is minimized.On the other hand,TRCID monitors nodes by multiple performance indicators through Hadoop heartbeat.Compared with the original timeout strategy,TRCID can detect failures more timely and make relevant schedule.The experiment is performed based on Hadoop.Frist of all,we run the overload estimation by evaluating execution time and hardware indicators.Then we purpose three scenarios based on workflow of TRCID and failure types.Experiments are designed considering job scale,failure rate and performed individually in three scenarios.The experiments show that TRCID outperforms original Hadoop fault tolerant scheme and decrease the infection of failures.
Keywords/Search Tags:MapReduce, Hadoop, Fault Tolerance, Checkpoint, Failure Detection
PDF Full Text Request
Related items