Font Size: a A A

Research On Reliability Analysis And Optimization Method For Heterogeneous Distributed Computing Systems

Posted on:2017-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:H WangFull Text:PDF
GTID:1108330491964052Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of large-scale scientific applications and the increasing scale of parallel data process-ing, Distributed Computing Systems (DCSs), which based on the grid and parallel computing, has become an important development direction for information technology and communication technology. Especially for data storage and computing platform connected by a lot of cheap, heterogeneous computing units, DCSs have been widely adopted and drawn great attention from both academics and industry. Heterogeneous dis-tributed computing platform has become an important component of Chinese strategic emerging industries, and the research of system performance and reliability optimization has become a hotspot at home and abroad. As the system size increases, applications demand long-term reliable operation of Heterogeneous Distributed Computing Systems (HDCSs). In addition, computing resources of heterogeneous systems could join and exit dynamically so that the influencing factors of parallel applications, such as the changes of the input parameters and heterogeneous system environments, will expose the parallel applications to uncertain and uncontrolled security threats. Therefore, the reliable execution of each parallel task is a key indicator for evaluating the quality of DCSs. Especially when the heterogeneous distributed computing systems are in the presence of correlated failures, the problem should be learned how to analyze the system reliability and optimize the application reliability of parallel applications through task scheduling.This dissertation starts from the problem of modeling and assessing the system reliability, focusing on the impact of stochastic, correlated failures on the service reliability of applications running on DCSs. In-depth research and exploration have been devoted to the system resource management and task allocation. A series of task scheduling theory and reliability optimization methods have been proposed under various constraints, such as performance and reliability costs, deadline constraints, heterogeneous resources in the presence of correlated failures, etc. The target is to solve some reliability theory and technical problems in the field of heterogeneous distributed computing systems. The main contributions of this dissertation are summarized as follows.(1) The dissertation extends existing reliability analysis methods of distributed computing systems, and proposes the approach to model and assess the impact of stochastic, correlated failures on the service reliabil-ity of applications running on DCSs. The heterogeneous distributed computing systems provide large-scale resource collaboration, wide area communication and data sharing. However, many traditional reliability analysis methods are based on the assumption that failures between computing resources are independent, and they do not consider the temporal-correlated and space-correlated failures in large-scale distributed sys-tems. Especially with the rapid development of nanoscale large scale integrated circuit, the probability of correlated failure caused by high energy electromagnetic radiation is growing rapidly. According to the re-source failure characteristics, failure model of distributed computing system is established. The dissertation proposes reliability measure method based on the Taylor expansion under correlated failures, then the impacts of Common Cause Failure (CCF) on system reliability from the perspective of system and network architec- ture are well analyzed. Simulation results manifest the main factors affecting the system reliability and Mean Time Between Failure (MTBF). On this basis, the dissertation goes further and proposes reliability theory and algorithms of redundant system and static system, and verifies the performance and effectiveness of the algorithms through the simulation results.(2) According to the heterogeneous, dynamic and wide-area characteristics of distributed computing systems, the dissertation proposes a list scheduling algorithm considering task execution time and reliability cost. The problem of analysing and selecting the most reliable communication links between dependent tasks under Arbitrary Processor Network (APN) is tackled. On this basis, by adding the prediction function of selecting distributed computing node, a task priority list considering heterogeneous, reliability cost in target HDCS is determined. Then, a list scheduling algorithm called Reliability-Driven Lookahead Scheduling algorithm (RDLS) with duplication strategy is presented. Simulation results show that the proposed algorithm performs better than Heterogeneous Earliest Finish Time (HEFT) and Reliability-Aware Scheduling algorithm with Duplication (RASD) while maintaining the same time complexity.(3) For Failure Trace Archive (FTA)-an online, public repository of failure traces collected from di-verse parallel and distributed systems, the dissertation proposes a model for spatial-correlated and temporal-correlated failure in HDCSs, and the spatial-correlated model includes physical as well as logical network topologies. By using Hammersley Clifford Theorem, there is an equivalence between markov random fields (MRFs) and certain types of Gibbs distribution. The aim of correlated failures modeling is to group dis-tributed computing nodes with failure relevance, thereby providing a basis for the selection of redundant nodes to increase the reliability of task execution, and in order to avoid selecting a plurality of computing nodes in a correlated failure group. The effectiveness and feasibility of proposed failure model are analyzed theoretically and evaluated experimentally.(4) For distributed computing systems in the presence of correlated failures and Directed Acyclic Graph (DAG) applications with deadline constraints, the dissertation proposes a critical path model based on deadline of each task and Sub-Deadline Allocation (SDA) algorithm for parallel DAG applications. On this basis, the Reliability-Driven Greedy Duplication (RDGD) algorithm and Cost-Driven Duplication (CDD) algorithm are also proposed. When selecting duplicated nodes to improve the reliability of DAG application, scheduling algorithms should avoid the same task being assigned to multiple nodes within Shared Risk Resource Group (SRRG). The rigorous performance evaluation experiments demonstrate that the critical path model based on sub-deadlines and scheduling algorithms based on different target optimization can not only improve the reliability of parallel DAG applications, but also meet the performance requirements.
Keywords/Search Tags:heterogeneous distributed computing system, correlated failure, reliability, DAG task schedul- ing, optimization design, deadline constraints
PDF Full Text Request
Related items