Font Size: a A A

Research And Optimization Of Adaptive Checkpoint Technique In Map Reduce

Posted on:2016-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2308330476953479Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Big data is gaining its popularity during the last few years and it has become one of the most significant areas of computer science. The human beings are exploring their brand new knowledge acquiring pattern. The ability to efficiently gather, store and analyze big data is a key to the success of a company. The size of data is increasing non-linearly and traditional database technology can no longer satisfy the growing demand. The MapReduce framework presented by Google in 2004 proved to be a milestone in big data analyzing. The paradigm behind MapReduce is quite simple, which makes scheduling and fault tolerance very straight-forward. However, Hadoop, the most popular open source implementation of MapReduce, often experiences significant performance downgrades in case of failures.Principle and execution flow of Hadoop MapReduce are deeply discussed, as well as its flaws. This article presents BeTL, which introduces slight changes but greatly improves Hadoop MapReduce performance under failures. Hadoop MapReduce implements its fault tolerance strategy in the task level, which results into a total re-execution in case of task failures. The MapReduce paradigm is so flexible that finer-grained strategy is definitely possible. The general idea of BeTL is to reuse partial results as much as possible and thus reduce job execution time.BeTL tries to make the most of spill files generated by map tasks and creates checkpoints based on it. Reduce tasks no longer shuffle per-map-task output files, and instead shuffle spill files. In case of map task failures, as long as a spill file is available,the new task attempt can skip the corresponding input range. Speculative execution can also benefit from the checkpoints and BeTL modifies the LATE algorithm to make the best of speculative execution. Some other optimizations are also discussed, including combiner cache and resilient checkpoint tactics. All of these help to minimize the overhead of BeTL and improve the effectiveness.Comprehensive experiments are designed and performed to fully understand behavior, overhead and performance of BeTL’s every single aspect. We deeply analyze the experiment data and other performance statistics and conclude that BeTL outperforms Hadoop under either no failures or intensive failures. BeTL introduces really slight changes to the original Hadoop MapReduce code base, and it makes a pretty good case for why simple change can provide significant benefits. Carefully collecting and analysing program performance statistics along with comprehensive bottleneck inspection and optimization is key to a success of a high performance system.
Keywords/Search Tags:Hadoop, MapReduce, Fault tolerance, Checkpoint, Adaptive, Task scheduler, Speculative execution
PDF Full Text Request
Related items