Font Size: a A A

Reliable ultra-low-voltage cache design for many-core systems

Posted on:2017-01-01Degree:Ph.DType:Dissertation
University:University of RochesterCandidate:Zhang, MeilinFull Text:PDF
GTID:1468390011997664Subject:Electrical engineering
Abstract/Summary:
We classify cache errors as hard or soft errors. Hard errors may be caused by manufacturing defects, threshold or supply voltage variations, or device aging, and soft errors are introduced by external particle strikes or other random noise. Traditionally, most soft errors manifest as single event upset. However, as we approach into the nanometer era, the probability of multi-bit upset increases significantly because a single particle strike can cause more cache cell upsets. To address both single bit upset and multi-bit upset, we propose two-layer error control codes, combining the error detection capability of a rectangular code and the error correction capability of a Hamming product code in an efficient way, to significantly improve system reliability while maintaining low area, power, and latency overhead.;To reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability, we exploit existing double-error correcting triple-error detecting (DECTED) codes, together with cache line disabling in an efficient way to handle both hard and soft errors. The proposed method uses DECTED codes for each cache line---1-bit error correction for hard errors, and the other 1-bit error correction for soft errors. When there are multiple faulty cells, the cache lines will be disabled. This approach can reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability. To further improve energy efficiency, an adaptive fault-tolerant cache architecture, which provides appropriate error control capability for each cache line based on the number of faulty cells detected, is proposed. We use single-error correcting double-error detecting (SECDED) codes for each cache lines to address soft errors, and extra parity bits are used when there are hard errors. Our experimental results show that the proposed method can further reduce supply voltage and increase cache reliability.;We also propose a two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable large-scale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To-Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8x, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133x, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way SECDED. The cost of the proposed approach is no more than 4% external memory access overhead. In order to improve system reliability in the scenario of cache coherence protocol, two different approaches are proposed: pre-write-back policy and uneven error-protection. Pre-write-back cache policy can reduce the number of cache lines with "irrecoverable" cache states, and uneven error-protection provides appropriate error control mechanisms for each cache line based on its cache state. Our analysis and experimental results show that the proposed uneven error-protection approach with pre-write-back policy can improve system reliability significantly. (Abstract shortened by ProQuest.).
Keywords/Search Tags:Cache, Error, Results show that the proposed, Improve system reliability, Voltage, Approach, Hamming product, Low
Related items