Font Size: a A A

Program Oriented Soft Error Tolerance

Posted on:2013-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:L XiongFull Text:PDF
GTID:1268330422974313Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The whole society benefits from the improvement of semiconductor technology andfabrication technology. However, succeeding technology has introduced new obstaclesto maintain this exponential growth rate in the number of transistors per chip. Moderntransistors are getting smaller, and their voltage thresholds are getting lower with thedevelopment current of semiconductor technology. The improvements of transistorsperformance lead to their tighter noise margins. Consequently, computer system that isconsisted of billions of modern transistrors is more vulnerable.Especially, if there are high energy particles striking the devices of computer system,system reliability could be affected. In this case, transient fault could occur on hardwaredevices. Those transient faults that occur on hardware devices are also called soft errors.Facing high energy particle striking, the states of transistors enclosed in the chip arebroken. Therefore, soft errors of chip are represented by flipped-bits of saved ortransferring data in hardware component. If there is another new write operation to thefaulty component, the component could be corrected to the expected state. As a result, ifthere are soft errors occur on hardware devices, those devices are not weared out. Withthe improvement of semiconductor technology, the threat of soft error to systemreliability is an important problem to be solved. In this paper, our concern is to toleratesoft errors, so as to improve the system reliability.Software-based method for tolerating soft errors is regarded as an efficient methodcompared with hardware-based fault tolerance method which needs additional hardware.In addition, software-based soft error tolerance method is independent of hardwaredesign, so this method is regarded as a flexible method to protect computer system fromthe effects of soft errors. Based on these advantages of software-based soft errortolerance method, researching on this method to tolerate soft errors is an important issuefor the research field of fault tolerance. Oriented soft error tolerance withsoftware-based method, this paper studies two main topics on soft error tolerance. Theyare analyses and estimation of the effects for soft error to computer system, andresearching on soft error tolerance method. The first topic is the basis of the second one.In sum, our contributions of this paper are as follows.1. This paper analyses the program error propagation caused by SEU (Single EventUpset) effects. Firstly, we analyze the effects of soft error to codes and data inprogram. According to the above analyses, soft error occurrence model is built.Based on soft error occurrence model, we analyze the program error propagationcaused by soft error in program. Besides, we use Petri Net model to describe systemstate transition during the propagation of program error. By combining the resuts offault injection experiments, we use Petri Net model to analyze the transition of system states. The simulation results of Petri Net model and fault injectionexperiment results demonstrate the effects of program error propagation caused bysoft errors. The results of our analyses and fault injection experiments can be ourbasis to design more efficient soft error tolerance method.2. This paper proposes a system reliability model under the effects of soft errors. Firstof all, we give a method to estimate Architectural Vulnerability Factor (AVF) ofhardware components. In this method, we explore the effects of soft errors toprogram control flow and data flow respectly. The role and related parameters ofhardware component are next explored during the program execution. Finally, wegive a method to estimate the AVF of hardware components. Furthermore, wepropose a new approach to measure system reliability under the effects of soft error.In our approach, hardware components reliability is concerned first. Systemreliability which delivers the ability to perform required function is concerned afterthen. In our approach, software reliability is used to represent system reliabilitybased on the mechanism that soft errors affect system reliability. By exploring theeffects of hardware state to the software reliability, we build a software reliabilitymodel under the effects of soft errors.3. On the background of the conflict between reliability and performance of SIHFTsystem, this paper gives an analysis of the optimization and trade-off for SIHFTsystem reliability and performance. We analyse soft error mask on the level ofprogram. Based on our analysis, these program parts which could mask soft errorare distinguished. In our optimized program redundancy, those program parts whichcan mask program errors need not to be protected. In this case, reliability andperformance of SIHFT system could be optimized. For those program redundancieswhich can lead to the conflict of reliability and performance of SIHFT system, wediscuss the trade-off between reliability and performance under its differentapplication areas. Based on different requirements to reliability and performance inSIHFT system, our discussion not only demonstrates how to configure programredundancy to meet system requirements, but also give an optimized solution to thetrade-off between reliability and performance.4. This paper proposes a new approach to implement SIHFT system. Our approachcan balance system reliability and performance via partial software protection. Inour approach, those unprotected software regions that are motivated by soft errormasks on the software level are related to dead code, and the code that has a lowprobability of being executed, as well as some partially dead code. For the protectedcode, every data item is copied and every operation is performed twice to ensurethat the data stored in the memory are correct. Besides, our approach ensures thatevery branch instruction can jump to the correct address by checking the conditionand the destination address. In addition, we also propose an application-level data flow error recovery approach which combines the technique checkpointing withinstruction level fault tolerance method. If there is a detected soft error in syatem,the program state is restored from a prior saved program state which is related tothe error data.5. Dynamic soft error tolerance method can cover more soft errors, so this paperproposes an approach to analyse dynamic software behaviours under the effects ofsoft errors. Furthermore, this paper proposes and implements a new dynamicsoftware-based approach to tolerate soft errors. In our analyses, those effects of softerrors on hardware to instructions are passed to the computing results of function inprogram. Backed by the computing results of function on the high level, thoseinstruction errors which can cause incorrect program outcome are identified infunction. Based on those different level software behaviors, we build the relationmodel between program characteristics and software reliability. We propose a newdynamic software-based approach to tolerate soft errors. In our approach, theprotected objective is run-time program. Our approach is implemented by thetechnique dynamic binary instrumentation. The approach can dynamically protectthose functions that could be easily affected by soft errors in program. As a result,our dynamic soft error tolerace approach can dynamically trade system relaibiltyand performance.
Keywords/Search Tags:Soft Errors, Single Effects Upset, Reliability, Performance, Program Level, Instruction Level, Fault Tolerance System, Static ProgramCompiling, Dynamic Binary Instrumentation
PDF Full Text Request
Related items