Font Size: a A A

Fault-Tolerance Techniques Research For The Parallel CFD Application Software Framework

Posted on:2015-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X G RenFull Text:PDF
GTID:1108330509961068Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of manufacturing technology and the increase of the processes number, the performances of high performance computers(HPC) keeps improving while the problem of the programming wall and the reliability wall became more and more serious, which hamper the development of the high performance application. To solve the problem of the programming wall, the researchers have proposed the parallel CFD application development framework, which decoupled the experts from several fields and enhanced the development efficiency of parallel application. The traditional hardwarebased fault tolerant method is facing many problems, such as the expensive cost and the lack of flexibility, while the software-based fault tolerant method can realize fault tolerant with lower costs and greater flexibility. The fault tolerant method for field oriented parallel software framework is different from the system-level fault tolerant and application-level fault tolerant, which combines the advantages of the both fault tolerant technology in terms of the cost and the friendly of user, providing new opportunities and tools for the method of fault tolerant oriented hardware faults.As a typical field of high-performance computing, CFD(Computational Fluid Dynamics) had a mature field oriented parallel software framework, who means a lot for the research about the technology of fault tolerant in the parallel CFD application development framework who can promote the applications of parallel CFD computation. This thesis, based on parallel CFD application development framework,research the technology of fault tolerant orienting CFD application development framework. We have designed and constructed a software fault tolerant architecture within the framework. Facing two key issues: the error dictation and error recovery, and we have also proposed a series of fault tolerant method and optimization techniques. The main work and contributions of this thesis are as follows:1. Building the error propagation model for parallel program based on the state transition graph theory(STG), and constructing application-level error propagation model corresponding with the characteristics of parallel CFD application, based on the error propagation model in parallel program.(Chapter 2)The transition of hardware faults in parallel program is the base of the research whose object is software fault tolerant technology orienting hardware fault. We proposed firstly the theory of state transition graph based on state program tracking, in this theory, we have made abstract the relation of conflicts, causes and effects, concurrent; together with the interactive behavior. Based on the theory of state transition graph in parallel program, we have modeled the serial program and the parallel communication into abstraction, on this model abstract constructed, we have analyzed the communicative behavior of the faults in the parallel program,including natives errors, data flow generates errors, control flow generates errors and other errors produced by communications, we have proposed also the error propagate equation in the serial program and the corresponding solution algorithm.With the expression statement changed abstractly communication operations, we have modeled the communicative behavior errors due to the parallel communication within the original theory frame of the serial program, and we have proposed error propagate equation in the serial program and the corresponding solution algorithm, who can guide researchers to get the information of each error set via static analyze.From the continuous model and the discrete model of parallel CFD application, this thesis have analyzed their core computation and characteristics, getting a computation model whose core is difference operation and a computation model whose core is computation stencil in the model CFD, who have been combined as a computation pattern whose core is computation stencil. Based on the computation stencil,we have given the propagate equation of the errors in the computation stencil, and the solution algorithm corresponding the application-level errors propagation in the stimulation process CFD, who can guide researchers to get the information of the error propagation in the application-level key data of CFD.2. Proposed fault tolerant architecture orienting parallel CFD application framework,based on the existing parallel CFD application framework.(Chapter 3)We have designed fault tolerant architecture orienting parallel CFD application framework, based on the error propagation model in parallel program and the applicationlevel CFD error propagation, together with the parallel CFD software application frame. Combining with the naturel tolerant base and the corresponding requirements of CFD applications, we have designed Synchronization recover scheme and Asynchronization recover scheme within the parallel CFD application framework.The key to the synchronization recover scheme is using periodical snapshot output to realize checkpointing backup, while use user-level sender-based messagelog technology in asynchronization recover scheme can solve the problem of repeat communication of the fail process.3. Propose error detection method orienting the computation stencil—GS-DMR, combined with the feature of discrete model based CFD application.(Chapter 4)Based on application-level error propagation model and the feature of parallel CFD application framework in the discrete model, Propose the Dual Modular Redundancy error detection method based the grid sampling can reduce greatly the detection cost for the soft errors in the computation stencil. Taking into account the propagation theory of soft errors in the grid sampling, we have analyzed that how can we get optimum error detection interval, optimum checkpointting interval and optimum sampling window size in the method GS-DMR, to get heuristic algorithm. Facing the problem of error detection blind due to the propagation delay in GS-DMR, we have proposed multi-solving strategies, including hybrid error detection scheme,hazard checkpoint scheme and multi-checkpoint scheme, choosing a mixing error detection scheme according to the applicative requirements.4. Propose asynchronous pipeline I/O for hiding periodic output cost in CFD Simulation framework——GS-DMR(Chapter 5)we present asynchronous pipeline I/O(AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in Open FOAM, providing CFD users with a user-friendly interface.
Keywords/Search Tags:Fault tolerant, Hardware fault, Fault propagation, Checkpointing, CFD, Parallel application, Development framework
PDF Full Text Request
Related items