Research On Fault Tolerance In Distributed Stream Data Processing

Posted on:2020-02-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Zhuang

Full Text:PDF

GTID:1368330602455535

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of large-scale stream data processing and analysis technology,nowadays distributed Stream Processing System(DSPS)has proven to be an effective way to process and analyze large-scale data streams in real-time fashions.Due to its strong parallel processing capability and system scalability,DSPSs have been the new distributed systems that are attaching more attentions.As the scale of the distributed systems continues to expand,system failure rates increase and reliability issues intensify.The size of data processing cluster has broken through 10,000-node level in the industry environment,such as Google and Facebook.It easily occurs several node failures everyday for the systems of this scale.Therefore,fault tolerance is crucial for DSPSs.DSPSs urgently need an ideal fault tolerance support for the following reasons.First,the one-pass processing mode of streaming data easily leads to the permanent loss of precious information,resulting in irreparable damage.Second,novel elastic DSPSs provide the ability to seamlessly adapt to stream workload changes,which introduce new reliability challenges.Third,the load of streaming data is constantly varying,it is hard for traditional fault-tolerant strategy to adapt,which is always causing unnecessary overhead and leading to a low processing efficiency.For the above three aspects,the main contribution of this paper include:(1)Introduction of an asynchronous incremental checkpoint mechanism and upstream backup trimming mechanism.Based on these attributes,a set of fault-tolerant protocols that coordinate low-overhead backup,and fast recovery from failures are presented for DSPSs.A prototype system named SPATE is implemented with the proposed fault-tolerant protocols,and the results of evaluations show that the approach can be used effectively to provide fast recovery with low overhead.(2)Propose a novel fault tolerant mechanism for elastic DSPSs.In particular,a self-adaptive backup unit is introduced,named elastic data slice(EDS),which can partition and merge data backups according to operator auto-scaling at runtime.The consistency of recovery is guaranteed by new upstream backup protocols.It restarts the system from the status after autoscaling instead of last checkpoint,which avoids high recovery latency.Evaluations on SPATE show that the mechanism supports runtime scaling changes with similar overhead compared to existing approaches,while achieving low recovery latency in spite of auto-scaling.(3)Present a novel load-aware DSPS Optimal Checkpoint Interval(DOCI)model and prove that it maximizes the processing efficiency for a given time period.An approach to dynamically adjust the OCI for an application is also proposed to accommodate the realtime workload fluctuations.Simulation experiments have been conducted to verify the effectiveness of DOCI model and the efficiency of the online OCI adjustment algorithm.Experimental results with a real-world dataset show DOCI model achieves an improvement on system efficiency by up to 32%,comparing with existing fault-tolerant approaches.

Keywords/Search Tags:

Fault-tolerance, Distributed Stream Processing, Recovery Latency, Upstream Backup, Optimal Checkpoint Interval

PDF Full Text Request

Related items

1	The Research And Implementation Of Checkpoint Technology Based On WinNT
2	Research On Fault-tolerant Strategy Optimization For FLONK Stream Processing Framework
3	Dynamic Adaptive Checkpoint Mechanism For Streaming Applications Based On Reinforcement Learning
4	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
5	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
6	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
7	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
8	Fault Tolerance For Distributed Parallel Stream Processing Systems
9	The Strategy Of Proactive-Reactive Intrusion Tolerance Recovery Based On Hierarchical Model
10	Design And Implementation Of Distributed Stream Computing Framework Fault Tolerance