Dynamic Adaptive Checkpoint Mechanism For Streaming Applications Based On Reinforcement Learning

Posted on:2021-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:X Liu

Full Text:PDF

GTID:2428330614950024

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

As the needs of big data application scenarios continue to evolve,streaming computing has gradually become a mainstream computing model.Streaming applications usually need to run continuously.In this process,it is inevitable to be affected by various types of software and hardware failures.In a distributed environment,failures will occur more frequently.Therefore,it is a research hotspot in the field of stream computing to ensure high reliability of stream processing applications without affecting real-time performance.In the mainstream fault-tolerant method of stream processing,compared to active backup that requires high backup resources,the checkpoint mechanism combining passive backup and upstream backup is currently a more efficient fault-tolerant method.When adopting the checkpoint fault tolerance method,selecting the appropriate checkpoint period is the key to ensure the smooth operation of streaming applications.The stream processing system represented by Apache Flink currently only supports fixedcycle checkpoints,which is difficult to make a good trade-off between the fault-tolerant cost and fault recovery cost in a dynamically changing stream application scenario.This paper first studies the fault-tolerant cost of the checkpoint method based on the barrier mechanism,specifically analyzes the impact of the checkpoint and fault recovery on the system performance during the running of the streaming application,and at the same time draws the main factors that affect the checkpoint runtime cost and fault recovery cost.Combined with the analysis of the checkpoint fault tolerance cost,this paper studies the dynamic adaptive adjustment of the checkpoint interval through the method of reinforcement learning based on the dynamic change of environmental indicators such as load and failure.While avoiding modeling of the overall environment of streaming applications,this method can adaptively optimize processing delays and failure recovery time at the same time.The algorithm is implemented based on the original fault tolerance mechanism of the Flink platform,which solves the problem that the original platform only supports fixed-cycle checkpoints.Finally,we compared our algorithm with the existing researched checkpoint interval optimization algorithm and model on the Flink platform.The experimental results show that the dynamic checkpoint interval adjustment algorithm proposed in this paper reduces the processing delay by 10% and the failure recovery time by 37% compared with the existing checkpoint interval optimization model and algorithm.It has a relatively stable optimization performance in different application scenarios.

Keywords/Search Tags:

stream processing, fault-tolerance, checkpoint interval, reinforcement learning, Flink

PDF Full Text Request

Related items

1	Research On Fault Tolerance In Distributed Stream Data Processing
2	Research On Fault-tolerant Strategy Optimization For FLONK Stream Processing Framework
3	The Research And Implementation Of Checkpoint Technology Based On WinNT
4	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
5	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
6	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
7	Design And Implementation Of Distributed Stream Computing Framework Fault Tolerance
8	Fault Tolerance For Distributed Parallel Stream Processing Systems
9	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
10	Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint