Font Size: a A A

Research On Fault Tolerance In Distributed Stream Data Processing

Posted on:2020-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhuangFull Text:PDF
GTID:1368330602455535Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of large-scale stream data processing and analysis technology,nowadays distributed Stream Processing System(DSPS)has proven to be an effective way to process and analyze large-scale data streams in real-time fashions.Due to its strong parallel processing capability and system scalability,DSPSs have been the new distributed systems that are attaching more attentions.As the scale of the distributed systems continues to expand,system failure rates increase and reliability issues intensify.The size of data processing cluster has broken through 10,000-node level in the industry environment,such as Google and Facebook.It easily occurs several node failures everyday for the systems of this scale.Therefore,fault tolerance is crucial for DSPSs.DSPSs urgently need an ideal fault tolerance support for the following reasons.First,the one-pass processing mode of streaming data easily leads to the permanent loss of precious information,resulting in irreparable damage.Second,novel elastic DSPSs provide the ability to seamlessly adapt to stream workload changes,which introduce new reliability challenges.Third,the load of streaming data is constantly varying,it is hard for traditional fault-tolerant strategy to adapt,which is always causing unnecessary overhead and leading to a low processing efficiency.For the above three aspects,the main contribution of this paper include:(1)Introduction of an asynchronous incremental checkpoint mechanism and upstream backup trimming mechanism.Based on these attributes,a set of fault-tolerant protocols that coordinate low-overhead backup,and fast recovery from failures are presented for DSPSs.A prototype system named SPATE is implemented with the proposed fault-tolerant protocols,and the results of evaluations show that the approach can be used effectively to provide fast recovery with low overhead.(2)Propose a novel fault tolerant mechanism for elastic DSPSs.In particular,a self-adaptive backup unit is introduced,named elastic data slice(EDS),which can partition and merge data backups according to operator auto-scaling at runtime.The consistency of recovery is guaranteed by new upstream backup protocols.It restarts the system from the status after autoscaling instead of last checkpoint,which avoids high recovery latency.Evaluations on SPATE show that the mechanism supports runtime scaling changes with similar overhead compared to existing approaches,while achieving low recovery latency in spite of auto-scaling.(3)Present a novel load-aware DSPS Optimal Checkpoint Interval(DOCI)model and prove that it maximizes the processing efficiency for a given time period.An approach to dynamically adjust the OCI for an application is also proposed to accommodate the realtime workload fluctuations.Simulation experiments have been conducted to verify the effectiveness of DOCI model and the efficiency of the online OCI adjustment algorithm.Experimental results with a real-world dataset show DOCI model achieves an improvement on system efficiency by up to 32%,comparing with existing fault-tolerant approaches.
Keywords/Search Tags:Fault-tolerance, Distributed Stream Processing, Recovery Latency, Upstream Backup, Optimal Checkpoint Interval
PDF Full Text Request
Related items