Font Size: a A A

Research On Fault Recovery Technology For Parallel Iterative Calculation

Posted on:2015-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:W XiangFull Text:PDF
GTID:2370330488499705Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Iterative method for solving linear equations is the core calculation method of engineering technology and scientific research.Nowadays,iterative method is mainly performed in distributed computing system,the probability of system failure is increased with the expansion of nodes in distributed system.System failure includes both hard crash and soft fail.The main online-recovery technology for restoring system failure is algorithm-based recovery technology,but the algorithm-based method suffers from limitation of high overhead which is a key obstacle of its widespread development.This thesis studies how to reduce the overhead of online-recovery technology and the main research includes the following two aspects.Firstly,Algorithm-based recovery technology can't guarantee failure recovery for all nodes.To address high overhead of algorithm-based recovery technology,the thesis presents failure recovery scheme based on information redundancy on the basis of algorithm-based recovery technology.This scheme transfers extra data to other nodes,so that each node has multiple backups when iterative calculation.When some nodes fail at the same time,the failure nodes can get correct data from other nodes.Although this scheme may cause extra overhead during data transferring,the overhead is negligible.Also,this scheme could guarantee failure recovery for multi modes at the same time,which avoiding re-computing.Secondly,Algorithm-based recovery technology is by setting the error detection and checkpoints to tolerate soft fault.When using Algorithm-based recovery technology,its fault-tolerant overhead is too large to guarantee its real-time.Based on different overhead of error detection and checking points,we take the time interval of error detection and checkpoints as variables,apply Markov chain to build a completion-time estimated model with computing tasks occurring faults,and obtain a formula for the best intervals of error detection and checkpoints,then achieve failure recovery with low overhead.Experiments results demonstrate that when using optimal Algorithm-based recovery technology,it reduces the time to obtain correct result by up to several orders of magnitude over the traditional Algorithm-based recovery technology.
Keywords/Search Tags:Iterative Methods, Hard crash, Soft Errors, Algorithm-Based Recovery
PDF Full Text Request
Related items