Font Size: a A A

Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment

Posted on:2022-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z YangFull Text:PDF
GTID:2518306350989639Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Fault-tolerance mechanism is a very important component of the stream computing system in the distributed stream computing environment oriented to real-time data processing.The checkpoint-based state rollback strategy is the current approach adopted by most stream computing systems,and Flink implements checkpointing through an asynchronous barrier snapshot algorithm,but under this fault recovery mechanism,the fault recovery time grows with the scale of tasks in the job,making the system recovery cost too high and performance poor.In this dissertation,a fast fault-tolerance mechanism is proposed,which is based on snapshots for single point of failure.The mechanism uses the latest snapshot of tasks to roll back the state of restarted tasks in the fault recovery phase,and then improves the recovery accuracy by using the cached data of upstream tasks,combined with the cached data of faulty tasks during normal operation to ensure the consistency of system processing results.By studying the asynchronous barrier snapshot mechanism and the internal data transfer mechanism of Flink,the data used for transfer in the pipeline stream is persisted,and the cache data storage is reset with the checkpoint cycle,and the single snapshot generated by the native mechanism and the persisted pipeline stream are used to complete the fast recovery of the failed task.At the same time,the queuing theory and the modified topological sorting algorithm are used to build a system delay model,and then the failure recovery delay models are built by combining the characteristics of Flink's native faulttolerance mechanism and fast fault-tolerance mechanism respectively,and the rationality of the fast fault-tolerance mechanism is demonstrated from the theoretical level through the comparison of the models.The comparison experiments are designed for both the native fault-tolerance mechanism and the fast fault-tolerance mechanism under the same environment.The experiments show that when the number of tasks increases from 4 to 12,the fault recovery time under the native fault-tolerance mechanism increases from 59 to 71 seconds,while the fast fault-tolerance mechanism remains unchanged.When subtracting the latter from the former,the value increases from 9 to 19 second,which verifies the effectiveness of the fast fault-tolerance mechanism.
Keywords/Search Tags:Snapshot, Fault-tolerance, Failure Recovery, Checkpoint, Stream Computing
PDF Full Text Request
Related items