Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment

Posted on:2022-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:Z Yang

Full Text:PDF

GTID:2518306350989639

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Fault-tolerance mechanism is a very important component of the stream computing system in the distributed stream computing environment oriented to real-time data processing.The checkpoint-based state rollback strategy is the current approach adopted by most stream computing systems,and Flink implements checkpointing through an asynchronous barrier snapshot algorithm,but under this fault recovery mechanism,the fault recovery time grows with the scale of tasks in the job,making the system recovery cost too high and performance poor.In this dissertation,a fast fault-tolerance mechanism is proposed,which is based on snapshots for single point of failure.The mechanism uses the latest snapshot of tasks to roll back the state of restarted tasks in the fault recovery phase,and then improves the recovery accuracy by using the cached data of upstream tasks,combined with the cached data of faulty tasks during normal operation to ensure the consistency of system processing results.By studying the asynchronous barrier snapshot mechanism and the internal data transfer mechanism of Flink,the data used for transfer in the pipeline stream is persisted,and the cache data storage is reset with the checkpoint cycle,and the single snapshot generated by the native mechanism and the persisted pipeline stream are used to complete the fast recovery of the failed task.At the same time,the queuing theory and the modified topological sorting algorithm are used to build a system delay model,and then the failure recovery delay models are built by combining the characteristics of Flink's native faulttolerance mechanism and fast fault-tolerance mechanism respectively,and the rationality of the fast fault-tolerance mechanism is demonstrated from the theoretical level through the comparison of the models.The comparison experiments are designed for both the native fault-tolerance mechanism and the fast fault-tolerance mechanism under the same environment.The experiments show that when the number of tasks increases from 4 to 12,the fault recovery time under the native fault-tolerance mechanism increases from 59 to 71 seconds,while the fast fault-tolerance mechanism remains unchanged.When subtracting the latter from the former,the value increases from 9 to 19 second,which verifies the effectiveness of the fast fault-tolerance mechanism.

Keywords/Search Tags:

Snapshot, Fault-tolerance, Failure Recovery, Checkpoint, Stream Computing

PDF Full Text Request

Related items

1	Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce
2	Research On Fault Tolerance In Distributed Stream Data Processing
3	The Research And Implementation Of Checkpoint Technology Based On WinNT
4	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
5	Design And Implementation Of Distributed Stream Computing Framework Fault Tolerance
6	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
7	Design And Implementation Of Application Anomaly Recovery Mechanism In Android System
8	Research On Rollback Recovery Fault-Tolerance Technology In High Availability Cluster
9	Research On Low Overhead Non-blocking Checkpointing Scheme For Mobile Computing System
10	The Research On Low-overhead Rollback Recovery Fault-Tolerance Technology