Research On Fault Tolerance Of High-performance Computing With NVRAM

Posted on:2014-10-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Li

Full Text:PDF

GTID:1268330422473982

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently, the technique of high-performance computing develops very rapidly. It isforecasted that the high-performance computing will enter exascale (1018Flops) era bythe year of2020. However, with the system scale increasing, the reliability ofhigh-performance computers decreases sharply. Therefore, the high-performancecomputing system must rely on the fault tolerance technique to complete the computingtask correctly. What is more, the increasing system scale, on one hand, decreases thesystem reliability; on the other hand, it increases the cost of fault tolerance technique.Researches show that the high cost from fault tolerance will reduce the utilization rateof exascale systems to be zero. As a result, the current fault tolerance technique cannotsatisfy the requirement of future high-performance computing, and we must researchnew techniques to address this challenge.The new emerging Non-Volatile Random-Access Memory (NVRAM)technologies promise large, fine-grained, fast, and non-volatile memory device forcomputer designers. Recently, the NVRAM technologies develop rapidly and will beavailable after the year of2015. Then, NVRAM may replace DRAM as the mainmemory; or add a new memory level between DRAM and the disk; or replace the diskas fast storage devices. No matter how NVRAM is integrated into the memory hierarchy,it will promise a new opportunity for the research of fault tolerance technique. In thispaper, we focus on how to leverage the NVRAM technologies to improve theperformance of fault tolerance techniques and do the following work:1. Algorithm-based fault toleranceAlgorithm-based fault tolerance (ABFT) approaches a very cost-effective methodto incorporate fault tolerance into applications. They adapt algorithms and applyappropriate mathematical operations on both the original and recovery data. Oncefailure occurs, they can recover the application dataset with a very low overhead.Currently, ABFT approaches are mainly used in matrix operations, and they are notsuitable for general data structures. To fill this gap, we propose an approach to enhancethe ability of ABFT based on the NVRAM technologies, and make ABFT suitable foralgorithms operating on link-based data structures. Our approach ensures the dataconsistency by maintaining the atomicity of each iteration. We demonstrate thepracticality of our approach by applying it to the Barnes-Hut algorithm and K-meansalgorithm. The experiment results show that our approach is able to survive failureswithin a performance overhead of10%2. Fault tolerance process modelIn the traditional process model, the OS and processes are coupled tightly and there-initialization of OS will destroy process data. Though the process executes in the NVRAM, it cannot be restored after the OS reboots. To address this challenge, wepropose a fault-tolerance process abstraction based on NVRAM, called NV-process,which supports fault tolerance natively. First, NV-process decouples processes from theOS, and processes are stand-alone instances running in a self-contained way inNVRAM. Second, NV-process provides a transactional execution model to make aprocess persistent effectively. Thirdly, NV-process provides the in-place restarttechnique to restore a process very efficiently. When the system is power off (no matterit is intended or not), NV-process instances reside in the NVRAM and can continuerunning where they left off after the OS reboots. The experiment results show thatNV-process could accomplish fault tolerance with a low performance overhead.3. Any-grained incremental checkpointThe cost of incremental checkpoint technique mainly comes from the dirty datadetecting and saving. Due to the limit of the disk bandwidth and block-access property,most state-of-the-art dirty data detection is coarse-grained and implemented inpage-granularity. Though the coarse-grained approach reduces the data detecting cost, itincreases the data saving cost. We do some experiments and observe that there is a lotof unmodified data in dirty pages during a checkpoint interval. In other words, there is alot of redundant data stored into checkpoint file under page-granularity incrementalcheckpoint mechanism. To address this issue, we design and implement a newincremental checkpoint scheme named AG-ckpt (Any Granularity checkpoint) based onNVRAM. Moreover, we also formulize the performance-granularity relationship ofcheckpoint systems through a mathematical model, and further obtain the optimumsolutions. The model is general, and can be adopted to optimize granularity parameterof other checkpoint systems. The experiment results show that our approach can gain aspeedup of1.2x-1.3x on checkpoint efficiency.

Keywords/Search Tags:

High-performance computing, The reliability of system, Faulttolerance, NVRAM, Process model, Algorithm-based fault tolerance, Thecheckpoint technology

PDF Full Text Request

Related items

1	Research On Memory Management And Fault Tolerance Mechanisms Based On NVRAM
2	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
3	Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers
4	The Study And Analysis On Fault-Tolerant Parallel Algorithm
5	A Fault Tolerant Mechanism Of Functional Language Based On NVRAM
6	Achieving Fault-Tolerance And High-Performance In Grid Applications
7	Algorithm Research On Multicast Routing With Fault-Tolerance And High Reliability
8	Algorithm Research On Multicast Routing With Fault-tolerance And High Reliability
9	Research On Fault-tolerance Technology Of Fault-Aware And High-Reliability Router In Three-Dimensional Network-on-Chip
10	Research On The Fault-Tolerant Technology Of The Joint Servo Controller Based-FPGA