| With the integration between high performance computing and internet technology, Grid systems have developed to be a infrastructure of distributed,heterogeneous and dynamic environment, connecting many kinds of resources above the application layer, providing seamless,reliability and unified service access interface, and achieving transparent access control to hardware,software,data,storage and other resources. Nevertheless, Grid systems are more prone to failures because of the highly dynamic and heterogeneous characteristics, the frequently occurrence of failures has become a main problem that puzzled many scientists, engineers and users. How to build an appropriate fault-tolerance mechanism to improve performance of fault detection and handing, thus ensure the reliability and stability of grid is one of the most difficult issues in Grid systems.Based on comparison research, the fault tolerance requirements of grid systems are summarized, and a dynamic fault tolerance management strategy was constructed. Moreover, the corresponding dynamic fault detection algorithm and QoS-restricted fault handling service selection algorithm are presented; finally, a task-level fault-tolerance service system for users above CGSP is achieved. The main research contents are as follows.①According to the characteristics of Grid, the special fault tolerance requirements of Grid environment were summarized. The author constructed the fault tolerance architecture including fault detection module,fault handling module and request proxy module, then the running process of the model was proposed.②Aimed at the problem that existing fault detection algorithms can not satisfy the requirement of multi-process fault detection in Grid system, an dynamic and scalable fault detection algorithm was presented. The author established a small world based grid system model and a fault detection model; Combined unreliable fault detection method with heartbeat strategy and grey prediction model, designed a dynamic heartbeat mechanism, and presented the dynamic and scalable fault detection algorithm.The hierarchical architecture of fault detection devices was introduced.The performance of the algorithm such as accuracy, completeness and reliability were analyzed. At last, experimental result demonstrated that the algorithm is valid and effective, can be used for fault detection under Grid environments。 ③Aimed at the problem that how to select fault handling service for different grid application programs,the author put forward a QoS-restricted fault handling service selecting algorithm. On the basis of analyzing fault handing related background and requirements, the formal definitions of several normal fault handling technologies were proposed, a scalable QoS-restricted fault handling model was constructed; the QoS-restricted decision problem was abstracted as a multi-property decision problem, and the Information Entropy decision method was constructed. The QoS-restricted fault handling service selecting algorithm was put forword, the correctness and effectiveness of the algorithm was demonstrated by simulation.④Aimed at the research of fault detection and handing, the author proposed the system design and implement of fault tolerance management service. The architecture and management process of platform CGSP were introduced; the design principle of fault tolerance management service was put forward, the core system services such as request proxy service, fault detection service and fault handing service were designed and implemented. Finally, the effectiveness of the fault tolerance mechanisms was demonstrated in a CGSP experimental environment.To sum up, according to the fault tolerance requirements of Grid services, this dissertation proposed a suit of solutions including the dynamic fault tolerance strategy, fault detection and fault handing. By means of theoretic analysis and simulation, it can be concluded that the strategy and the algorithms are correct and effective, which can be used in fault detection and handing in Grid enviroments, and have the advantages on improving reliability and stability of Grid systems. |