Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System

Posted on:2007-07-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y S Liu

Full Text:PDF

GTID:1118360215470552

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Fault tolerance in distributed system is an issue full of challenges, also an active researchsubject in distributed simulation system nowadays. Because the implementation of faulttolerance needs to solve many theoretic problems, such as failure detection algorithm, statesave and restore protocol, fault-tolerant scheduling algorithms, etc. Meanwhile, fault tolerancedirectly decides the reliability of simulation system, it is of great theoretic and practicalsignificance to study fault tolerance problem in distributed simulation system. Consideringcharacteristics of distributed simulation systems, this paper has deeply explored the theoreticproblems and engineering practice that fault tolerance of distributed simulation systems needs,and part of the engineering practice benefit from grid technology.First of all, framework design for Distributed Simulation Fault Tolerant System (DS-FTS)was carded out. We analyzed the significance of grid for amending drawbacks of distributedsimulation system and decided to realize DS-FTS based on grid technology. Further, weanalyzed the characteristic of fault tolerance of distributed simulation system, and presentedthe idea of fault tolerance during simulation system life-cycle. We analyzed possible faults insimulation system, established the fault-tolerant level of DS-FTS, and studied fault-tolerantdesign patterns of simulation system. We also examined the relationship between DS-FTS andsimulation system, designed hiberarchy and function of DS-FTS, and discussed keytechnologies needed to be solved.Failure detection is the basis of fault tolerance. Its performance is affected by the timingcharacteristic of system model. We analyzed the timing characteristic of large scale distributedsimulation system, and decided to use partially synchronous model to describe it. Based oncharacteristics of multi-federation architecture of HLA simulation system, we brought forwarda kind of distributed hierachical system-level failure detection strategy and correspondingalgorithm Hi-UA-DSD. In this algorithm, simulation nodes are grouped into differentsub-testing rings, and failure detection is divided into intra-ring and inter-ring detection. Theformer is based on UA-DSD algorithm and the latter is based on UA-DSD-Int algorithm. Thecorrectness proof and evaluation of the algorithm shows that, compared with other algorithms,Hi-UA-DSD has higher accuracy, less network overhead, lower diagnosis latency and betterexpandability. The algorithm is competent for failure detection of HLA simulation system, aswell as other large scale distributed simulation system. As for those small scale simulationsystems, UA-DSD can be used instead.Federation save protocol provides system state data for fault tolerance. IEEE 1516-2000adopts a kind of blocking federation save protocol, which brings huge overhead to simulationsystem. According to factor analysis that affects the component states of simulation systemduring federation save, we proposed a kind of unblocking federation save protocol CICCP,which avoids time inconsistency between saved RTI state and federate state due to timeadvance, as well as state inconsistency among saved federate states due to messages transferand also the problem of in-transit messages. The overhead of this protocol is fairly small.Besides, in order to guarantee the consistency of federation restore, for general simulation system, we brought forward a crossed time advance method which can maintain theconsistency when federate issues time advance requirement with zero lookahead, and forreal-time distributed simulation system, we partly extended HLA OMT (Object ModelTemplate) and HLA transmission services based on network QoS (Quality of Service)technology to guarantee the repeatability of network. The above protocol and resolvent can beused to deal with the same kind problems in other kinds large scale distributed simulationsystems.Different storage strategies for checkpoint files cost differently, and the execution of theabove failure detection algorithm and federation save protocol bring overheads to simulationsystem. We set up a Markov chains model for simulation system and studied the aboveproblems aimed at achieving maximum system availability. As a result, we proposed asuitable storage strategy and calculation formulas for computing best heartbeat and checkpointinterval. This part is a complementarity of the above two parts.Fault-tolerant scheduling algorithm is the final embodiment of fault tolerance, for whichboth failure detection algorithm and federation save protocol serve. Firstly, according to theidea of ICM, we advanced a fault-tolerant scheduling algorithm framework, namelyICM-FTSA. Then, based on two types of fault-tolerant models, two kinds of heterogeneousfault-tolerant scheduling algorithms were put forward respectively, which are CSP-RTFT andWM-RTFT/RC-RTFT. CSP-RTFT is based on an improved Spare Processor model (CSP),while MW-RTFT and RC-RTFT are based on the Primary-Backup model. Primary versiontasks are scheduled by the minimum Worst Case Response Time (WCRT) heuristic rule inMW-RTFT, whereas in RC-RTFT they are scheduled by synthetic rules of minimumreliability cost and minimum WCRT. And in both algorithms, backup version tasks arescheduled by the minimum WCRT heuristic rule. These two kinds of scheduling algorithmscan be embedded into ICM-FTSA to produce more flexible algorithms. Simulation resultsshow that they can meet the requirements of fault-tolerant scheduling of simulation systemsunder different situations.For the engineering practice, based on DS-FTS framework and the above algorithms andprotocols, we realized failure detection module, system state saving and restoring module, andfault-tolerant scheduling module for DS-FTS. In practice, we redesigned and implemented aHLA-based battle simulation system with the idea of fault tolerance for simulation systemlife-cycle and utilized DS-FTS to support the system. The result shows that the idea of faulttolerance for simulation system life-cycle combined with DS-FTS can solve the fault tolerantproblems in large scale distributed simulation system.

Keywords/Search Tags:

Large-scale Distributed Simulation System, Fault Tolerance Failure Detection, System State Save Protocol, Heartbeat and Checkpoint Interval, Recovery Strategy, Fault-tolerant Scheduling

PDF Full Text Request

Related items

1	A Checkpoint-Based Fault-Tolerant Service In Distributed Systems
2	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
3	Research On The Task Fault-tolerant Scheduling Optimization Algorithms For The Distributed Real-Time System
4	Distributed File System Level Fault-tolerant Mechanism
5	Research On Fault Tolerance In Distributed Stream Data Processing
6	Study Of Fault Tolerance Checkpoint Algorithm In Distributed System And Software Design
7	The Research And Implementation Of Checkpoint Technology Based On WinNT
8	Research And Implementation Of Fault Recovery Mechanism In Large-scale Graph Processing System Based On BSP
9	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
10	Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce