Font Size: a A A

The Study And Implementation Of High-efficiency Fault-tolerant Processor

Posted on:2014-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:G H LiuFull Text:PDF
GTID:1228330422974191Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the fabrication technology progresses continuously, microprocessor is more andmore confronted with transient fault, reliability of processor has become a severechallenge at present. This paper investigated the reliability problem of multi-coreprocessor from the perspective of the analysis of fault propagation behavior, exploresthe design and implementation technology of high efficient fault-tolerant multi-coreprocessor. The contribution of this is paper is as follows:1. Analyzed the fault propagation behavior in a single thread. Based on the currentresearch of using checkpoint method to tolerate fault by segment, using data streamanalysis method, we analyzed the propagation behavior of data error incurred byhardware fault as the instructions are executed, analyzed the fault propagation behaviorbetween sections in a single thread, established the corresponding error propagationequation and algorithm. We proposed fault locating analyzing method based on theknown faults, obtained the minimum set needed to check data to avoid the faultpropagation between multiple segments. Theory on fault propagation behavioranalyzing in a single thread could be used to guide the fault-detect and fault-tolerancedesign of processor cores.2. Analyzed the fault propagation behavior between multiple threads. Specifically,we analyzed the segment structure of the shared memory parallel program; analyzed thefault propagation behavior on different parallel segment structures, and found that theheterocyclic structure is the reason for the backward polluted propagation of thegenerated fault; proved that heterocyclic structure and pure-cyclic structure could beconverted to non-cyclic structure by modifying the segment dividing method, so that thebackward polluted propagation of the generated fault could be avoided; pointed out theinfluence of weak memory consistency model on the fault propagation behaviorbetween multiple threads. Theory on fault propagation behavior analyzing betweenmultiple threads could be used to guide the fault-detect and fault-tolerance design ofmulti-core processor.3. BRO-SOC framework is proposed to sum up the relationship between the faultdetection/isolation boundaries and the system memory hierarchy. It defines a"correctness domain", and function units and program states in which have the ability ofmaintaining correctness logically as computing goes on. A new fault-tolerantprocessor architecture based on temporal redundancy, DoubleRun, is proposed under theBRO-SOC framework. DoubleRun splits program into instruction chunks (fault-freetransaction) and executes each chunk twice, then compare the signature of the twotemporal redundant chunks to detect faults. The innovation of DoubleRun includes:(1)Using temporal redundancy to tolerate the transient fault, which eliminates the problems of on-chip inter-core queues or customized communication channel.(2) Setting the faultdetection and isolation boundary at the proper memory hierarchy of SOC framework,which decreases the fault propagation distance and detection latency, condenses the sizeof speculative context and reducing the overhead to manage it, and avoids theperformance degradation caused by modifying the matured design of processor pipeline.(3) Using store operations to detect faults, and employing CRC algorithm to encode allstore information into a fingerprint, thus improving the fault detection efficiency bycomparing fingerprint.(4) Using pure hardware approach to implement the checkpointmechanism in the correctness domain, reducing the overhead of creating andmaintaining checkpoints.4. The DoubleRun processor architecture is extended to DoubleRun-MP aparallel fault-tolerant processor architecture. Using the temporal redundancy, thecomputing and verification process of a parallel application can be done locally in eachprocessor core, without involving the other processor cores, hence achieving thelocalization and distribution of fault detection and new state committing and increasingthe system’s scalability. DoubleRun-MP introduces the PSB buffer to support multipleunverified fault-free transactions in progress simultaneously; it can avoid the busywaiting problem caused by fault-free transaction dependence, and increase the processorutilization. To support the sharing of unverified data, DoubleRun-MP adopts the MOESIcache coherence protocol and modifies it to support the shared bus transaction producedby original execution entity and redundant execution entity. Based on the conclusionsfrom Chapter3, we propose the read-before-write program splitting mechanism toconstruct fault-free transactions for parallel environment to avoid the domino effectcaused by the rollback when fault is detected. Then we use Lamport logical clock toorder the fault-free transactions, and the logical order of transaction facilitates theircommitting and rollback. To solve the problem of input coherence for the two executionentities of fault-free transactions, we propose the concept of memory access windowand add an instruction age table for each transaction, preventing write operations frombreaking the memory access window to ensure that each execution entity of fault-freetransaction can see the same memory image, hence ensuring the correctness of theparallel application execution semantic.
Keywords/Search Tags:reliability, transient fault, fault-tolerant processor, sharedmemory, section execution, fault propagation
PDF Full Text Request
Related items