Font Size: a A A

Research On Configurable Fault Tolerance Techniques For Transient Faults

Posted on:2014-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:1108330479979661Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Processor design trends toward smaller transistors, lower core voltage and higher frequency make transient faults become a critical reliability concern for the entire compute market. As the users from different fields usually have different requirements on reliability, hardware cost, performance and power consumption, how to enhance the system reliability considering different reliability and overhead constraints poses a challenge to the processor designers. To meet such a challenge, this dissertation focuses on configurable and low-cost fault tolerance techniques. In addition, to evaluate the impact of transient faults and the reliability of the proposed fault tolerance techniques, this dissertation also researches on fault injection techniques. To be specific, the main contributions of the dissertation are as follows:1. The faults in computation units of a processor may induce data-flow errors(DFEs)or control-flow errors(CFEs) during program execution. Existing techniques usually rely on redundant computation to detect DFEs. How to reduce the considerable overheads(in terms of performance, hardware, etc.) introduced by redundant computation is an intractable issue for the DFE detection techniques. Combining the advantages of both hardware- and software-based fault tolerance solutions, this dissertation presents a configurable DFE detection technique called Epipe. By making minimal modifications to a modern superscalar processor, Epipe firstly provides a hardware platform which can selectively protect instructions by performing instruction replication. Since there exist abundant compute resources in a modern pipeline, the extra hardware cost incurred by the Epipe platform is low. To reduce the performance overheads introduced by redundant computation, Epipe analyzes the criticality of every instruction, i.e., the probability that an instruction makes the program produce incorrect results when subjected to transient faults. During program execution, Epipe chooses a subset of the most critical instructions to protect according to the reliability and the performance requirements. The novelty of Epipe lies in the fact that Epipe only protects a part of instructions which tend to cause incorrect program outputs. Meanwhile, the faults that trigger system exceptions can be detected easily by existing exception detection mechanisms in systems. The remaining masked faults can be ignored directly since they do not ultimately prop-agate to user-visible corruptions. By handling different faults in different strategies and developing low-cost hardware protection technique, Epipe reduces the instructions which need replication significantly, and finally can detect DFEs efficiently.2. An effective method to detect CFEs is software-implemented signature monitoring.Existing signature monitoring based techniques not only have drawbacks in terms of performance overhead, memory overhead, and reliability, but also lack configurability, thus cannot accommodate different requirements from different users.In addition, the extra instructions introduced by software solutions may also be corrupted by faults. Unfortunately, existing CFE detection techniques cannot provide self-protection. Given this situation, this dissertation proposes a configurable control-flow checking algorithm, i.e., CFCES. By allocating formatted signature for each program block and instrumenting control-flow checking instructions into the program blocks, CFCES can detect more faults than existing control-flow checking algorithms with moderate overheads. Meanwhile, CFCES embeds an invariant(i.e., equality) into the designed control-flow checking mechanism. By checking this invariant, CFCES can detect the faults happening in control-flow checking instructions under extremely low cost. Furthermore, CFCES provides a configurable optimizing approach by analyzing the criticality of each function and tuning the granularity of the program blocks, so that the specific overheads and reliability constraints of different users can be satisfied. The optimizing approach can improve the fault tolerance efficiency of CFCES, and can be applied to other signature monitoring based algorithms.3. Transient faults not only occur in computation units of a processor, but also exist in memory units. Because the on-chip memory structures occupy most area of a processor and are frequently accessed, traditional memory protection techniques like ECC are not appropriate for the on-chip memory structures due to prohibitive hardware area, performance and power overheads. This dissertation focuses on protecting a special kind of on-chip memory structure, i.e., SPM, and proposes a low-cost fault tolerance technique named PPS. The key insight behind PPS is that although leveraging ECC to protect the whole SPM is expensive, protecting part of SPM and then making reasonable use of the protected memory units is still meaningful. PPS firstly designs a SPM-based memory architecture which only protects a part of the SPM units(the proportion of the protected units can be varied with the differ-ent reliability, performance and power requirements of different applications), then performs vulnerability analysis for all the variables to be allocated and divides SPM into different pseudo-registers, finally, PPS allocates the most vulnerable variables into the protected pseudo-registers by performing priority-based graph coloring. In this manner, PPS can provide effective memory protection with limited overheads.4. Fault injection is an effective and widely-used approach for reliability evaluation.The drawback of this approach is that it is difficult to trade-off between simulation speed and accuracy. This dissertation develops an efficient fault injection framework, named Smart Injector, by performing program analysis. Smart Injector firstly removes the equivalence class faults and the known-outcome faults from the initial fault space based on program analysis. The equivalence class faults are the faults which happen in similar control-flow or data-flow context. Since this kind of faults tend to influence a program similarly and induce the same fault outcomes, SmartInjector only selects a representative from each equivalence class to study through a detailed fault simulation and prunes the faults it represents. The known-outcome faults are the faults whose outcomes can be obtained by conducting program analysis only. Smart Injector also exploits a fault outcome prediction technique, which can determine the outcome of a simulation before it runs to completion by predicting the fault outcome with location. In this way, the simulation time for a single simulation can be reduced. Employing the proposed fault pruning and fault outcome prediction techniques, Smart Injector reduces the requirement of computational resources prohibitively while maintaining high accuracy of analysis results.
Keywords/Search Tags:Transient Faults, Program Analysis, Data-flow Checking, Controlflow Checking
PDF Full Text Request
Related items