Research On The Key Technology Of Fault Tolerance Based On Fault Data Preprocessing For Supercomputing Systems

Posted on:2020-05-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Huang

Full Text:PDF

GTID:1488306548491684

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The development of supercomputer represents the way to a tremendous expanding scale and the increasing complexity in its system architecture,which undoubtedly brings extremely great challenges in reliability and resilience,and the fault prediction turns to be one of the most important parts with its generally transient,diverse and uncertain feature characterizing in supercomputer systems.Hence it urgently put forward a more complex requirement on fault information collection,fault prediction and fault tolerant technology.Being consisting of efficient data collection and accurate analysis capabilities,data preprocessing technology provides powerful protection to fault tolerant technologies in computing systems.The maximum data,generated by exascale computing,will grow from TB to PB,which requires a higher aggregate bandwidth for the purpose of reducing latency in the large data collection,and the real-time data collection that can easily generate a large number of bursty I/O requests,both of them are the biggest bottleneck affecting system I/O performance.At the same time,the decrease in I/O performance will also affect the execution efficiency of fault tolerance for supercomputing systems.In order to ensure the large-scale application operated efficiently and improve the storage utilization efficiency of I/O-intensive applications in the supercomputer,the reliability and I/O problems in supercomputer systems are proposed and the solution involves in multi-faceted research in-depth research and experimental analysis are developed to fault data pre-processing techniques,fault-tolerant techniques and I/O problems.The main results obtained are as follows:The fault data preprocessing technology have been designed and optimized for supercomputer system.Firstly,when facing complex application environments where bursty I/O can occur,a real-time data collection framework for supercomputer systems,composed by a data collector,H2 FS,and a distributed data collection manager,is proposed in terms of increasing scale and low data collection efficiency by adding efficient H2 FS to provide high performance and availability support for the entire acquisition framework.Secondly,for the incomplete collection information related to runtime applications,it can optimize the performance analysis tools being used for collecting and analyzing typical application performance and enrich the types of data collected in the real-time data collection framework.Thirdly,in order to improve the accuracy and timeliness of system fault analysis and diagnosis,an online log template extraction method based on offline preprocessing is proposed,which is composed by two parts: an offline log template extraction process in Tianhe-1A supercomputer platform for the analysis of existing offline log template technology,an real-time fault data collection framework that is designed to quickly analyze the log incrementally in the middle layer of the storage and then is conjuncted with the real-time data collection module.Finally,it is verified with higher performance and better scalability through the experimental results,as well as the accuracy of online log template extraction method based on offline preprocessing.Fault tolerant technology with a multi-dimensional XOR-based checkpoint/recovery is proposed to the possibility of system failure at runtime and the number of failed nodes involved in large-scale application.Frequent system failures can make it longer on the supercomputer platform than the execution time required by the task,while traditional checkpoint/recovery techniques often struggle to balance the recovery time with the storage capacity.In order to solve these problems,we propose a checkpoint/recovery fault tolerance method based on multi-dimensional XOR,and analyze the fault-tolerant framework based on mathematical function library.Fault-tolerant operation of large-scale parallel applications can be handled through multi-dimensional XOR checkpoint/recovery fault tolerance,thus the reliability of the system can be greatly improved without excessively increasing the storage capacity,which then have turned out to be verified by experiments.A storage workload management model(SWMM)for supercomputing systems is proposed in order to solve the impact of a large number of bursty I/Os on system performance and fault tolerance efficiency.It optimizes I/O paths when multiple data-intensive applications accessing file systems in parallel,hence improving bandwidth efficiency.At the same time,the capacity balancing strategy for supercomputer storage systems is optimized to solve the capacity imbalance problem in storage expansion.These technologies can further improve the efficiency of application operations,while alleviating the impact of I/O performance in fault tolerant technology.We implemented our solution and tested it on the Milky Way-1A(TH-1A)supercomputer.The experimental results showed that the I/O path optimization and capacity balancing strategy achieved the desired effect,and the data collection module had low overhead and high transmission efficiency in the transmission of small data blocks.

Keywords/Search Tags:

Supercomputer Systems, Fault Data Preprocessing, Real-time Data Collection Framework, Log Template Extraction, Multi-Dimensional XOR Checkpoint/Recovery Fault-tolerant, I/O Performance, Storage Workload Management Model

PDF Full Text Request

Related items

1	Distributed File System Level Fault-tolerant Mechanism
2	A Checkpoint-Based Fault-Tolerant Service In Distributed Systems
3	Robust integration of multi-level fault detection mechanisms and recovery mechanisms in a component-based support middleware model for fault-tolerant real-time distributed computing
4	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
5	The Research Of Real-time Fault-Tolerant Mechanism In Distributed Real-time System DRTAS
6	Fault-Tolerant Of MPI Programs Based On Rollback Recovery
7	Research On Failure Analysis,Modeling And Prediction For Supercomputers
8	Research On Key Technology Of Fault-Tolerant Nanoscale Circuit Based On Statistical Model
9	Fault-Tolerant Task Scheduling Algorithms For Real-Time Systems Based On ICM Model
10	Research On The Task Fault-tolerant Scheduling Optimization Algorithms For The Distributed Real-Time System