Font Size: a A A

Fault Tolerance For Distributed Parallel Stream Processing Systems

Posted on:2022-08-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:X T WangFull Text:PDF
GTID:1488306482487754Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of computer and network technologies,as well as diverse ways of data generation,has stimulated dynamic high-speed data continuously generated in a growing number of industry domains.Such data are usually called streaming data,whose value decreases dramatically over time.In recent years,we have met the explosive growth in requirements of real-time processing for streaming data.However,the timely processing requirements often exceed the capacity of traditional databases and batch processing technologies,which take no remarkable superiority due to the store-then-process pattern.Therefore,stream processing technologies have emerged and become one of the key areas for big data management in both academia and industry nowadays.A large quantity of stream processing systems have sprung up for dealing with large scale and diverse streaming data.With 24×7 low-latency processing requirements and higher processing ability of systems,clusters with hundreds of commodity servers or cloud platforms are used for practical and scalable development of stream processing systems.For instance,the largest cluster of Blink in Alibaba consists of over 1,500 servers;Stream Scope in Microsoft is deployed on a shared cluster containing 20,000 machines.However,failures are ubiquitous as failure probabilities increase with the growing scale of cluster and increasing running time.Failures usually cause erroneous results,or even make the systems unavailable,resulting in significant deterioration of service quality and financial benefits.Therefore,fault tolerance has become one of the most important building blocks in distributed parallel stream processing systems.A broad spectrum of fault tolerance techniques have been proposed.Among them,checkpoint-based passive replication approaches are widely adopted by mainstream distributed parallel stream processing systems owing to the low complexity and low resource consumption.We aim to uncover the potential defects of the existing checkpointing mechanisms for distributed stream processing systems,and design corresponding solutions,finally achieving highly efficient fault tolerance.Specifically,we aim to ensure fast recovery while reducing the overhead incurred by fault tolerance during failure-free runtime.The main contributions of this thesis are as follows.(1)Designing a benchmarking framework for fault tolerance techniques of distributed parallel stream processing systems: Previous work on streaming bench-mark mainly focuses on the overall performance evaluation during failure-free runtime,without considering the impacts of fault tolerance.To compensate for the absence of this research work,we first propose a benchmarking framework for fault tolerance techniques of distributed parallel stream processing systems.Specifically,the workloads and metrics of previous work cannot be applied directly into the evaluation of fault tolerance techniques.Hence,we define specific metrics in terms of extra overhead on performance and recovery efficiency.What's more,we propose six workloads using three typical application programs,by configuring data-related,application-related and fault-tolerance-related parameters.After benchmarking two widely-used stream processing systems,we uncover two significant phenomena,which lay the foundation for the following two research work.(2)Proposing an adaptive checkpointing mechanism to reduce negative impact on performance: In real production environments,developers determine the value for checkpointing interval according to their experience and application requirements.However,checkpointing interval has significant influence on system availability.Specifically,shorter checkpointing interval implies more frequent interference on the normal processing,leading to higher latency;while longer checkpointing interval potentially causes more accumulated tuples to be replayed,leading to longer recovery time.What's more,static checkpointing interval cannot make system avoid triggering checkpointing procedure under traffic peak and may lead to longstanding performance deterioration.Therefore,we optimize the model of checkpointing interval for both uncoordinated and coordinated checkpointing mechanisms based on operator and topology utilization respectively.Besides,using the idea of time-series-segmentation based on hierarchical cluster,we design an adaptive adjustment strategy to alter the checkpointing interval when fluctuations happen to stream rate.(3)Introducing a cost-effective load balance mechanism to improve recovery efficiency: Data skewness is constantly happening in real-world scenarios.Nowadays distributed parallel stream processing systems usually adopt key-based routing policies to send tuples to downstream parallel instances.Hence,when streaming data is skewed,load imbalance among downstream parallel instances is inevitable.Apart from the performance degradation during failure-free runtime time,load imbalance will significantly affect the recovery efficiency.Overloaded instances with larger state have lower recovery speed and become the stragglers during recovery.Therefore,we aim to ensure the balance status of system while reducing the overhead during load adjustment.Specifically,we propose a key-based hybrid data routing policy.Besides,we design three cost-effective load adjustment strategies which take into account the resource overhead of CPU,memory and network.Owing to the outstanding balance quality,the recovery efficiency has been greatly improved.In summary,we have comprehensively reviewed previous work on fault tolerance techniques and benchmarks for stream processing systems.Based on the evaluations using our specific stream benchmarking framework for fault tolerance,we propose two aspects of optimizations to improve the efficiency of fault tolerance: adaptive checkpointing mechanism to reduce the extra overhead incurred by fault tolerance during failure-free runtime and cost-effective load balance mechanism to improve the recovery efficiency.Extensive experiments have been conducted using different workloads and datasets to fully verify the effectiveness and efficiency of methods proposed in this thesis.
Keywords/Search Tags:Distributed Stream Processing, Performance Evaluation, Fault Tolerance, Checkpointing, Load Balancing
PDF Full Text Request
Related items