As the scale and complexity of a distributed system increase, the issue of reliability becomes increasingly important but hard to achieve. However, a system's reliability is especially important for some critical fields, e.g. aerospace, military, finance, and medical, etc. Failure detection is one of the fundamental essential components to achieve reliable distributed systems. Because of the rapid expansion in scale of systems, adaptation and scalability of failure detection become very important. Some critical problems related to these two attributions are studied in this paper. The adaptive failure detector which can adapt to multiple application requirements is firstly studied; then aiming to hierarchical detection method, the availability of Leader node and detection of faulty link are studied. The results of the paper can be applied into design of hierarchical detection, and provide theoretical support for the mechanism of failure detection of large scale distributed systems.There are large numbers of distributed applications with different QoS (Quality of Service) of failure detection in distributed systems. Thus, in order to keep its efficiency and scalability, a failure detector should not only provide QoS of accurate failure detection for multiple applications, but also avoid redundant loads of designing multiple detectors for different QoS. Therefore, a new failure detector QA-FD (adaptive failure detector based on QoS) is presented, which adopts heartbeat detection strategy based on PULL mode. QA-FD can provide QoS of failure detection for multi-applications according to quantitative QoS metrics(TDU ,TMRL,TUM), and does not need any hypothesis about message behavior and clock synchronization. In addition, it proves that QA-FD implements a failure detector that belongs toâ—‡P in the partially synchronous model, and the experimental results are given in the end. The method of hierarchical detection is an important method to greatly reduce detection costs and effectively improve scalability of system. It however suffers from the single point of failure in each group, and the Leader node of each group thus bencomes reliability bottleneck of system. A solution of high available Leader node is proposed, which adopts dual fault-tolerance mechanism based on arbitration. First, we used Markov model to analyze the dual fault-tolerance system and pointed out the important influences of failure detection mechanism with high coverage and success rate to the availability of system. Aiming at the lower successful failure detection in traditional system, a detection mechanism of arbitration is proposed. The Arbitrator can reach very high reliability due to adopting design of fault-tolerance, when dual nodes cannot make right decision to the failure, the arbitrator, as a dependable third party, can locate the fault exactly, which improves effectively the success rate of failure detection. Basing on this, combining self-detection and heartbeat detection, a multi-layer failure detection mechanism is presented, and applied to the design of practical system to achieve a solution of high available Leader node. In addition, it is proved by experiments that the solution can satisfy requirement of availability.In most of failure detection algorithms of distributed system, failure model is restricted to failure of process, and link failure is simply masked, or modeled by process failure. Both methods can soon use up system resource and potentially reduce the availability of system. A failure Detection Protocol based on Heartbeat of multiple Master-nodes (DPHM) is proposed, which can immediately and accurately detect and locate faulty links by adopting voting mechanism among master-nodes. In addition, DPHM has ability of electing new master-node, which can further improve the continuous work time as well as the availability of the system.Byzantine is the most severe failure model, the cost of masking or detection of byzantine is very expensive. However, it must be included in extremely high reliability systems. Thus, the method of failure detection in presence of both byzantine links and nodes is studied in this paper. Because the behavior of Byzantine component is arbitrary, the model of Invalid Link is proposed firstly, which can more accurately describe the affection of Byzantine behavior under dual failure components, and improve failure detection coverage. Based on Invalid Link model, an evidence-based failure detection protocol, PLFDA, is presented; it can detect simultaneously both byzantine processors and links. The Proof of correctness and complexity of PLFDA and experimental results are given in the end. |