Font Size: a A A

Two-dimensional memory system protection

Posted on:2009-04-23Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Kim, JangwooFull Text:PDF
GTID:2448390005953346Subject:Engineering
Abstract/Summary:
In modern computer systems, the memory system plays a key role in determining the computer system's overall performance and power consumption. However, the memory system is also the most vulnerable component in the system that directly impacts the system's overall manufacturing costs and run-time reliability. As fabrication process technologies scale into the deep nanometer regime, both the frequency and scale of manufacturing defects (mostly caused by variability errors) and run-time errors (mostly caused by soft errors and wearouts) will increase. These errors will cause high manufacturing costs, information losses, and physical failures. However, conventional memory protection techniques such as error correcting codes (ECC) and memory redundancy cannot handle errors that occur in such an increasing frequency and cannot scale without incurring high VLSI overheads.;This thesis first proposes 2D error coding, a scalable multi-bit error protection technique applied 'within' embedded memory arrays, which combines in-line small-scale error correction and off-line large-scale error correction to detect and correct large-scale information losses (e.g., multi-bit upsets) at minimum VLSI overheads. This thesis evaluates this scheme in the cache hierarchies of two chip multiprocessor designs and shows that 2D error coding can correct clustered errors up to 32x32 bits during run time with significantly smaller performance, area, and power overheads than conventional techniques.;Next, this thesis investigates how this increased resilience can be traded off for higher-density bitcells, higher cell performance, greater cell stability, and lower power design by correcting variability-induced manufacture-time hard errors in embedded memory arrays, while maintaining ∼100% yield. By conducting a series of Monte Carlo simulations of scaled cell models with device variability, this thesis first identifies a strong potential of using multi-bit ECC for variability tolerance, and then proposes 2D erasure coding, a low-overhead multi-bit ECC designed to correct variability-induced manufacture-time hard errors at the speed of conventional single-bit ECC by making use of erasure coding algorithm. The proposed scheme when combined with a small amount of row redundancy significantly improves the memory access latency, power, and stability, while maintaining ∼100% yield and run-time reliability.;This thesis proposes RunFlat memory, a highly reliable, available, and serviceable (RAS) distributed shared-memory (DSM) system to survive large-scale run-time hard errors such as node failures. RunFlat memory applies 2D protection 'across' off-chip memory arrays by combining a conventional block-level protection (e.g., ECC, 2D coding) and a node-level memory RAID protection. RunFlat memory combined with a hardware-based on-line memory reconfiguration mechanism can detect and correct entire node failures, enable continued operation, and allow on-line repair service, while preserving the system's original performance and protection. Full-system simulations of a 16-node DSM server show that RunFlat memory incurs a negligible performance overhead during error free mode and significantly reduced performance overheads when operating with a failed node.;This thesis proposes two-dimensional (2D) memory protection techniques for building highly reliable, available, and serviceable memory systems while maintaining low manufacturing costs and high yields. The key innovation of 2D memory protection is to take reconstruction of large-scale information loss off the critical path of normal operations, that is distinct from low-overhead small-scale error detection and correction mechanisms. 2D memory protection can be applied at various levels of the memory system from on-chip memory arrays to off-chip memory modules and nodes. This thesis proposes and evaluates three distinct applications of 2D memory protection techniques: 2D error coding, 2D erasure coding, and RunFlat memory to combat multi-bit errors, variability errors, and node failures, respectively.
Keywords/Search Tags:Memory, 2D error coding, Protection, Errors, Node failures, Erasure coding, Performance, ECC
Related items