Font Size: a A A

Dynamic Adaptive Checkpoint Mechanism For Streaming Applications Based On Reinforcement Learning

Posted on:2021-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2428330614950024Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
As the needs of big data application scenarios continue to evolve,streaming computing has gradually become a mainstream computing model.Streaming applications usually need to run continuously.In this process,it is inevitable to be affected by various types of software and hardware failures.In a distributed environment,failures will occur more frequently.Therefore,it is a research hotspot in the field of stream computing to ensure high reliability of stream processing applications without affecting real-time performance.In the mainstream fault-tolerant method of stream processing,compared to active backup that requires high backup resources,the checkpoint mechanism combining passive backup and upstream backup is currently a more efficient fault-tolerant method.When adopting the checkpoint fault tolerance method,selecting the appropriate checkpoint period is the key to ensure the smooth operation of streaming applications.The stream processing system represented by Apache Flink currently only supports fixedcycle checkpoints,which is difficult to make a good trade-off between the fault-tolerant cost and fault recovery cost in a dynamically changing stream application scenario.This paper first studies the fault-tolerant cost of the checkpoint method based on the barrier mechanism,specifically analyzes the impact of the checkpoint and fault recovery on the system performance during the running of the streaming application,and at the same time draws the main factors that affect the checkpoint runtime cost and fault recovery cost.Combined with the analysis of the checkpoint fault tolerance cost,this paper studies the dynamic adaptive adjustment of the checkpoint interval through the method of reinforcement learning based on the dynamic change of environmental indicators such as load and failure.While avoiding modeling of the overall environment of streaming applications,this method can adaptively optimize processing delays and failure recovery time at the same time.The algorithm is implemented based on the original fault tolerance mechanism of the Flink platform,which solves the problem that the original platform only supports fixed-cycle checkpoints.Finally,we compared our algorithm with the existing researched checkpoint interval optimization algorithm and model on the Flink platform.The experimental results show that the dynamic checkpoint interval adjustment algorithm proposed in this paper reduces the processing delay by 10% and the failure recovery time by 37% compared with the existing checkpoint interval optimization model and algorithm.It has a relatively stable optimization performance in different application scenarios.
Keywords/Search Tags:stream processing, fault-tolerance, checkpoint interval, reinforcement learning, Flink
PDF Full Text Request
Related items