Font Size: a A A

Dynamic Cluster Strategy For Hierarchical Rollback-Recovery Protocols

Posted on:2017-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:B S ZhangFull Text:PDF
GTID:2348330503989882Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
High Performance Computing systems expand rapidly at a large scale, increasing to millions of processor cores over the past few years. Thus, the mean time between failures(MTBF) decreases to hours, which is significantly shorter than the execution time of most current scientific applications, leading to more time spent on dealing with failures. Therefore, fault tolerance in parallel computing becomes increasingly important. Hierarchical rollback-recovery protocols are commonly used fault tolerance mechanisms for message passing applications combining coordinated checkpointing and message logging together and are a good solution.However, according to the study on communication mechanism and communication mode of MPI, it is easy to find that such protocols may not obtain the best efficiency because the communication pattern in different stages of applications may change. To further improve the efficiency of hierarchical rollback-recovery protocols, a dynamic cluster strategy(DCS) is proposed to adapt to the change of communication pattern by using a prediction scheme. In the prediction scheme, the application is partitioned into several parts and the clusters obtained from prior parts are used in the succeeding part. Thus, DCS reduces the overhead caused by the migration of processes.Detailed experiments are then carried out to evaluate the efficiency and scalability of DCS using two static process partition algorithms on the High Performance Linpack benchmark. And a cost function is defined to evaluate the volume of wasted computing resources from DCS. The results show that DCS can reduce the message logging size efficiently about 24% to 45%. What's more, DCS performs better than static cluster strategy by having less cost about 15% and better scalability.
Keywords/Search Tags:high performance computing, fault tolerance, message passing interface, rollback-recovery protocols
PDF Full Text Request
Related items