Font Size: a A A

Study And Implementation Of Fault Tolerance For OpenMP Programs

Posted on:2011-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y FuFull Text:PDF
GTID:1118330341451665Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
So far, the scale of high performance computing system has increased much.Meanwhile, the mean time between failures has dropped dramatically. Therefore,fault tolerance techniques must be adopted so that the high performance computingapplications are able to tolerate errors caused by hardware failures.For a long time, the shared memory architecture were not used widely, due toits limitation on scalability. Consequently, the fault tolerance for shared memorysystems was not studied well. But recently, as multi-core processors are being usedto build high performance computing systems, ever more MPP systems have beenbuilt on SMP nodes. Therefore, research on fault tolerance for shared memoryarchitectures is significant.OpenMP is the dominant parallel programming model used for shared memoryarchitectures. This work is dedicated to research of application-level fault toler-ance technology, and is focused on rollback-recovery fault tolerance schemes. Thecontribution lies in:1. An error spreading model is established, to describe the spreading of errorscaused by hardware failures occured during the execution of programs. We pro-pose the concept of error spreading graph, based on which we draw conclusionsto guide the design of error detection and error recovery schemes.2. A non-block application-level checkpointing approach for OpenMP is proposed.The approach take an extended OpenMP parallel data ?ow analysis as its the-oretical foundation, to choose those'must-be-saved'variables to save in check-points. In such a way, it lowers the overhead for saving and restoring check-points.3. A novel fault tolerance scheme for OpenMP, making use of parallel recomputing,named PR-OMP, is proposed. As errors tend to occur in only one or two threadsduring an OpenMP program's execution, we can exploit those non-error threadsto perform the computation of error recovery in parallel, to lower the overheadfor error recovery.4. A redundancy-based fault tolerance approach for OpenMP, named TriThread,is proposed. For an OpenMP program, multiple computation copies are created dynamically, and their mediate computation result are compared and votedon certain occasions. Hence the program can tolerate errors without savingany computational state. Comparing with checkpointing and PR-OMP, thisapproach has the explicit advantage in scalability. It can be used as a substitutewhen checkpointing and PR-OMP lost applicability due to scalability limitation.
Keywords/Search Tags:Fault Tolerance, Checkpointing, OpenMP, Parallel Comput-ing
PDF Full Text Request
Related items