Font Size: a A A

Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors

Posted on:2003-05-15Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Sorin, Daniel JeremyFull Text:PDF
GTID:1468390011979133Subject:Computer Science
Abstract/Summary:
To address downward trends in availability and designability, we propose using a lightweight checkpoint/recovery scheme called SafetyNet. SafetyNet is a hardware-only scheme that allows a shared memory multiprocessor to recover its system-wide state—including processor registers, caches, and memories—to a previous checkpoint. Thus, in the case of an error due to a device fault or a design fault, SafetyNet allows the system to recover to a pre-error state and re-execute. SafetyNet has three distinguishing features that enable it to provide error-free performance that is statistically equivalent to that of an unprotected system. First, it coordinates the system-wide checkpoints in logical time and leverages “logically atomic” cache coherence transactions. Second, SafetyNet uses an optimized logging scheme to reduce the amount of checkpoint state. Third, it pipelines checkpoint validation—the process of determining that a checkpoint is error-free and can be made the new recovery point—and keeps it entirely in the background.; We demonstrate that SafetyNet can be used in conjunction with a variety of existing, error detection schemes to improve system availability. We also use SafetyNet to innovate in the areas of availability and designability. To improve availability, we leverage SafetyNet 's ability to tolerate long error detection latencies. SafetyNet can tolerate latencies that are long enough to enable much stronger error detection techniques than are currently feasible. These techniques can use inter-node communication and system-wide invariant checking. To improve designability, we use SafetyNet to enable speculatively correct designs, as well as to certain classes of unintentional design faults. For rare and complicated system events, we demonstrate that we can fall back on SafetyNet (and treat these events as errors) instead of devoting design time and verification effort towards handling them.; We evaluate SafetyNet with full-system simulation and commercial workloads. Our results show that SafetyNet has negligible impact on error-free performance, while avoiding data corruption and system crashes when errors occur. We show that SafetyNet can provide this error recovery with reasonable storage costs and with negligible additional cache bandwidth.
Keywords/Search Tags:Safetynet, Availability and designability, Checkpoint, Improve, Error
Related items