Font Size: a A A

A User Level Message Logging Protocol For Open FOAM

Posted on:2014-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2308330479979444Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of computational fluid mechanics requires the high performance parallel computing systems(HPC). Unfortunately, the rise in size of HPC systems has been accompanied by an overall decrease in the mean time between failures(MTBF).In order to make large-scale CFD parallel programs simultaneously survive crashes and mitigate the reliability-wall effects in such systems, we need efficient and reliable fault tolerance mechanisms.OpenFOAM is a free, open source CFD software package and has a large user base across most areas of engineering and science. It provides an implicit, pressure-velocity,iterative solution framework, therefore users can implement physical models efficiently and flexibly by mimicking the forms of partial differential equations at the topmost hierarchy of the software. The existing fault-tolerance method in OpenFOAM typically tolerate fail-stop failures under the synchronous offline scheme. While we propose in this paper a new message logging protocol which is seamlessly integrated in the framework of OpenFOAM. The innovation points of our work include:1. Designing and implementing a fault-tolerance framework in OpenFOAM(Chapter 2)By simply modifying the configuration file in the forms of a natural similar language,our framework provides fault-tolerance ability automatically without any programming burdens attached.2. Introducing asynchronous online recovery to OpenFOAM(Chapter2)Based on coordinated checkpointing mechanism and user level message logging protocol, we propose an asynchronous online recovery framework which only restart the failure process, thus avoiding a global restoration.3. Proposing a user level message logging protocol, revolutionizing the faulttolerance method for collective communications(Chapter 3)The user level message logging protocol elevates payload copy, failure handling and recovery procedure to the user code level to present the beneficial features below:? Recording the collective communication result into the sender’s message loggeras a whole alleviates the fault tolerant overhead for collective communications.? Imposing a fault tolerance layer above ULFM guarantees a level of portability? As long as the source codes of parallel program do not use any source reception andnon-deterministic probe, event logging can be safely disabled in our protocol.4. NPB benchmarks and molecular dynamics simulation demonstrate the validity and superiority of our method(Chapter 5)Experimental results on TH-1A outline a great improvement on failure free performance and recovery time reduction, proving that the new methodology proposed is simple yet effective, and it is particularly suited for collective communication intensive programs.
Keywords/Search Tags:OpenFOAM, Message logging, User level failure mitigation, Asynchronous online recovery
PDF Full Text Request
Related items