Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors

Posted on:2003-05-15

Degree:Ph.D

Type:Dissertation

University:The University of Wisconsin - Madison

Candidate:Sorin, Daniel Jeremy

Full Text:PDF

GTID:1468390011979133

Subject:Computer Science

Abstract/Summary:

To address downward trends in availability and designability, we propose using a lightweight checkpoint/recovery scheme called SafetyNet. SafetyNet is a hardware-only scheme that allows a shared memory multiprocessor to recover its system-wide state—including processor registers, caches, and memories—to a previous checkpoint. Thus, in the case of an error due to a device fault or a design fault, SafetyNet allows the system to recover to a pre-error state and re-execute. SafetyNet has three distinguishing features that enable it to provide error-free performance that is statistically equivalent to that of an unprotected system. First, it coordinates the system-wide checkpoints in logical time and leverages “logically atomic” cache coherence transactions. Second, SafetyNet uses an optimized logging scheme to reduce the amount of checkpoint state. Third, it pipelines checkpoint validation—the process of determining that a checkpoint is error-free and can be made the new recovery point—and keeps it entirely in the background.; We demonstrate that SafetyNet can be used in conjunction with a variety of existing, error detection schemes to improve system availability. We also use SafetyNet to innovate in the areas of availability and designability. To improve availability, we leverage SafetyNet 's ability to tolerate long error detection latencies. SafetyNet can tolerate latencies that are long enough to enable much stronger error detection techniques than are currently feasible. These techniques can use inter-node communication and system-wide invariant checking. To improve designability, we use SafetyNet to enable speculatively correct designs, as well as to certain classes of unintentional design faults. For rare and complicated system events, we demonstrate that we can fall back on SafetyNet (and treat these events as errors) instead of devoting design time and verification effort towards handling them.; We evaluate SafetyNet with full-system simulation and commercial workloads. Our results show that SafetyNet has negligible impact on error-free performance, while avoiding data corruption and system crashes when errors occur. We show that SafetyNet can provide this error recovery with reasonable storage costs and with negligible additional cache bandwidth.

Keywords/Search Tags:

Safetynet, Availability and designability, Checkpoint, Improve, Error

Related items

1	Research And Implementation Of The Checkpoint Technology In High Availability System
2	Research Of Process Migration Mechanism Based On Checkpoint In Computational Grid
3	Optimization Strategies For Storage In Distributed Checkpoint System
4	The Research And Implementation Of Checkpoint Technology Based On WinNT
5	Checkpoint Optimization Methods For Parallel Microreboot
6	Research And Implementation Of The Novel Heartbeat Inspecting Technique
7	WAAS error, integrity and availability modeling for GPS based aircraft landing system
8	Research On The Practice Of Keep APP Users' Behavior From The Perspective Of Media Availability
9	Multi-hiberarchy Based Availbility Analysis Method For Global Navigation Satellite System
10	A holistic redundancy- and incentive-based framework to improve content availability in peer-to-peer networks