Font Size: a A A

Research On Fault-tolerant Strategy Optimization For FLONK Stream Processing Framework

Posted on:2020-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:X QingFull Text:PDF
GTID:2428330590474464Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of big data and Internet of Things technology,a large number of real-time applications have emerged in the market.This kind of application requires data should be collected,processed and analyzed in real time,and then the results of data processing can be delivered in real time with sub-second delay.Stream computing is a new computing paradigm for real-time computing.Stream applications usually run uninterruptedly.It is unavoidable to encounter various faults during running periods,especially in large-scale distributed environments.Therefore,fault-tolerant recovery in stream computing has always been a research hotspot in this field.Traditional fault-tolerant strategies for streaming applications mainly include active backup,passive backup,upstream backup and rollback recovery based on checkpoints.Each fault-tolerant method has its own advantages and disadvantages.Flink,a stream processing framework,implements a lightweight asynchronous checkpoint based on the barrier model.However,there are still some shortcomings to be optimized and improved during the use of flink.Firstly,flink supports only fixed interval checkpoints.Checkpoint interval is a significant parameter affecting faulttolerant overhead and recovery time.If the checkpoint interval can be adjusted according to the dynamic changes of stream data,the system operation efficiency will be greatly improved.Secondly,flink only provides checkpoint-based fault tolerance mechanisms.For those stream applications with high reliability requirements,a single checkpoint based recovery mechanism is difficult to satisfy the requirements of fast recoveries of applications.In order to solve above two problems,two optimization strategies are proposed in this paper.One is checkpoint interval optimization model.Based on open-loop Jackson queuing network,this paper proposes a delay model for application processing and a fault recovery model of checkpoints,and proposes an optimization method for checkpoint interval based on above model.The experimental results show that the performance model in this paper can well fit the actual operation effect of flink system,and can recommend the optimized checkpoint interval according to the system reliability related indicators.The second one is the optimization strategy of partly active backup for critical tasks.From the point of view of job topology,this paper uses network connectivity analysis and improved PageRank algorithm to rank task according to their criticalities.On the basis of critical path analysis,the first N key tasks under resource constraints are identified and backed up actively,which further improves the reliability of the system.The experimental results demonstrate that the partly active backup method proposed in this paper can make full use of the spare resources of the system and ensure the fast recovery of critical tasks,thus improving the overall reliability of the application.
Keywords/Search Tags:stream computing, checkpoint interval, queueing model, partly active backup, critical tasks, Flink
PDF Full Text Request
Related items