The Study And Implementation Of High-efficiency Fault-tolerant Processor

Posted on:2014-12-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G H Liu

Full Text:PDF

GTID:1228330422974191

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As the fabrication technology progresses continuously, microprocessor is more andmore confronted with transient fault, reliability of processor has become a severechallenge at present. This paper investigated the reliability problem of multi-coreprocessor from the perspective of the analysis of fault propagation behavior, exploresthe design and implementation technology of high efficient fault-tolerant multi-coreprocessor. The contribution of this is paper is as follows:1. Analyzed the fault propagation behavior in a single thread. Based on the currentresearch of using checkpoint method to tolerate fault by segment, using data streamanalysis method, we analyzed the propagation behavior of data error incurred byhardware fault as the instructions are executed, analyzed the fault propagation behaviorbetween sections in a single thread, established the corresponding error propagationequation and algorithm. We proposed fault locating analyzing method based on theknown faults, obtained the minimum set needed to check data to avoid the faultpropagation between multiple segments. Theory on fault propagation behavioranalyzing in a single thread could be used to guide the fault-detect and fault-tolerancedesign of processor cores.2. Analyzed the fault propagation behavior between multiple threads. Specifically,we analyzed the segment structure of the shared memory parallel program; analyzed thefault propagation behavior on different parallel segment structures, and found that theheterocyclic structure is the reason for the backward polluted propagation of thegenerated fault; proved that heterocyclic structure and pure-cyclic structure could beconverted to non-cyclic structure by modifying the segment dividing method, so that thebackward polluted propagation of the generated fault could be avoided; pointed out theinfluence of weak memory consistency model on the fault propagation behaviorbetween multiple threads. Theory on fault propagation behavior analyzing betweenmultiple threads could be used to guide the fault-detect and fault-tolerance design ofmulti-core processor.3. BRO-SOC framework is proposed to sum up the relationship between the faultdetection/isolation boundaries and the system memory hierarchy. It defines a"correctness domain", and function units and program states in which have the ability ofmaintaining correctness logically as computing goes on. A new fault-tolerantprocessor architecture based on temporal redundancy, DoubleRun, is proposed under theBRO-SOC framework. DoubleRun splits program into instruction chunks (fault-freetransaction) and executes each chunk twice, then compare the signature of the twotemporal redundant chunks to detect faults. The innovation of DoubleRun includes:(1)Using temporal redundancy to tolerate the transient fault, which eliminates the problems of on-chip inter-core queues or customized communication channel.(2) Setting the faultdetection and isolation boundary at the proper memory hierarchy of SOC framework,which decreases the fault propagation distance and detection latency, condenses the sizeof speculative context and reducing the overhead to manage it, and avoids theperformance degradation caused by modifying the matured design of processor pipeline.(3) Using store operations to detect faults, and employing CRC algorithm to encode allstore information into a fingerprint, thus improving the fault detection efficiency bycomparing fingerprint.(4) Using pure hardware approach to implement the checkpointmechanism in the correctness domain, reducing the overhead of creating andmaintaining checkpoints.4. The DoubleRun processor architecture is extended to DoubleRun-MP aparallel fault-tolerant processor architecture. Using the temporal redundancy, thecomputing and verification process of a parallel application can be done locally in eachprocessor core, without involving the other processor cores, hence achieving thelocalization and distribution of fault detection and new state committing and increasingthe system’s scalability. DoubleRun-MP introduces the PSB buffer to support multipleunverified fault-free transactions in progress simultaneously; it can avoid the busywaiting problem caused by fault-free transaction dependence, and increase the processorutilization. To support the sharing of unverified data, DoubleRun-MP adopts the MOESIcache coherence protocol and modifies it to support the shared bus transaction producedby original execution entity and redundant execution entity. Based on the conclusionsfrom Chapter3, we propose the read-before-write program splitting mechanism toconstruct fault-free transactions for parallel environment to avoid the domino effectcaused by the rollback when fault is detected. Then we use Lamport logical clock toorder the fault-free transactions, and the logical order of transaction facilitates theircommitting and rollback. To solve the problem of input coherence for the two executionentities of fault-free transactions, we propose the concept of memory access windowand add an instruction age table for each transaction, preventing write operations frombreaking the memory access window to ensure that each execution entity of fault-freetransaction can see the same memory image, hence ensuring the correctness of theparallel application execution semantic.

Keywords/Search Tags:

reliability, transient fault, fault-tolerant processor, sharedmemory, section execution, fault propagation

PDF Full Text Request

Related items

1	Research On Fault Tolerant Wireless Interface Design And Fault Tolerant Routing Algorithm In WiNoC
2	The Design And Implementation Of Fault-tolerant Based On Minicore System
3	Research On Fast Reconstruction Algorithms For Fault Tolerant Processor Arrays
4	Research On Fault Recovery Techniques For Soft Errors Of COTS DSP
5	Design And Implementation Of Fault Injectors For High-End Fault-Tolerant Computer
6	Research On Transient Fault Recovery And Safety Control Of Networked Control System
7	Research And Implementation Of A Fault-Tolerance Evaluation Approach On The Fault-Tolerant Prototype
8	Research And Implementation Of A Fault-tolerance Evaluation Approach On The Fault-tolerant Prototype
9	Robust Fault Estimation And Active Fault Tolerant Control For Uncertain System
10	Compile-Based Intermediate Code Key Variable Fault-Tolerant Technology