Font Size: a A A

Software Implemented Hardware Fault Tolerance

Posted on:2007-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L GaoFull Text:PDF
GTID:1118360242499216Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Onboard computers are very important to information processing in space. In space environments, transient hardware faults bring great impacts on onboard computers. Radiation hardened components can improve system reliability, but their performance lag several generations behind COTS components. Radiation hardened components are very expensive due to their rare availability, and they often consume more power, take up more space and weight heavier. They are not suitable to build high performance space computers. Compared with radiation hardened components, COTS components have very high performance, lower price and lower power dissipations. Software implemented hardware fault tolerance on COTS components can provide space computers with high reliability, high performance, low cost and low power dissipations.But there still remain problems. The problems include how do hardware faults propagate within software, how is the fault tolerance capability of software measured, and what effects can it bring to system reliability. And there is great overhead if we use software to tolerante hardware faults, how to minimize this overhead is still a problem.In this paper, we first setup computational data flow model, based on what we setup error flow model. By categorizing errors into two kinds, introducing 6 rules of error propagation and 2 error independence rules, we can get error probility of any data at any time. According to the concept of fault tolerance, we defined the fault tolerance capability of a program. We analyzed the consequences the fault tolerance of a program can bring to the fault tolerance and performance of a system. Take fault tolerance capability as a target, we suggested that by equivalent transformation based on error flow analyses we can improve the fault tolerance capability of a program during compiling time. Finally, we give two optimized fault tolerance algorithms which can improve performance and reduce power dissipations at the same time.Our major contributions can be concluded into 5 aspects as below,1. We defined concepts of atomic data and computational relations to describe relations between registers or storage units, which are affected by computations in programs. We setup the model of computational data flow. We defined error probability function of atomic data and error propagation probability function of computational relations, with which we setup the error flow model on top of computational data flow model. Error flow model described how errors propagate through computational relations in a probability way. By analyses on error flow model, we can compute the error probability of any registers or any other storage unit at any time. Finally we setup a theory framework of error flow analyses.2. To measure the capability of a program's fault tolerance, we defined a concept of fault tolerance capability based on error flow analyses, give a method of error flow anayses to calculate fault tolerance capability of any program. And we suggested a method to improve a program's fault tolerance capability by error flow analyses and equivalent transformation, without any explicit redundancy. Finally we applied error flow analyses to describe the method to build a double redudancy fault tolerant system, and describe the effects on a double redudancy fault tolerant system if we improve a single program replica's fault tolerant capability.3. We suggest the concept of key subgraph of error flow graph, which has critical effetcs on a program's fault tolerance capability, and give the methods to generate key subgraph from key nodes or key paths. And we suggest a partial redundancy fault tolerance algorithm by only replicating key subgraph instead of whole error flow graph. Compared with EDDI, partial redundancy can improve IPC by 10%, reduce execution time by 15%, and reduce power dissipations by 10%, at a cost of very little loss of error comverage.4. Based on error flow analyses, we suggest error flow compressing algorithm to reduce branch instructions inserted in EDDI algorithm, which have great impacts on performance and power dissipations. Compared with EDDI, error flow compressing algorithm can improve IPC by 12%, reduce execution time by 10%, reduce power dissipations by 5%, at a cost of very little increasement of error latency.
Keywords/Search Tags:SIHFT, COTS, Space Computer, Error Flow Model, Error Flow Analyses, Fault Tolerance Capability, Key Subgraph of Error Flow, Error Flow Compressing Algorithm, Partial Redudancy Algorithm, Non-redundancy Fault Tolerance Compilation
PDF Full Text Request
Related items