Optimizing Load Balance In Distributed Stream Processing System

Posted on:2018-10-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J H Fang

Full Text:PDF

GTID:1318330512485360

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As people pay more attention to the potential value in big data,real-time data analysis is playing an increasingly important role in data analytic applications.Some typical examples for stream-based data sources,like 3G/4G cellular data,network monitoring data,sensor data,etc.,are extremely large.For example,averaging 100,000 tweets are tweeted on Twitter every minute;the generation of Shanghai telecom user data is up to 2 million records per minute;the high-speed acquisition rate of Chinese currently largest solar optical telescope achieves 1GB per second in the chromosphere channel;the urban road vehicle monitoring system in Taizhou collects the vehicle license plate numbers and trajectories at the speed of 240 million records per second.In general,this type of data has some common features,including durative,randomness,variability,unpredictability and so on.Meanwhile,as streaming data has life circle and its value decreases as time goes,traditional database techniques can no longer satisfy the stream data processing due to its requirements on massive dynamic data storage and real-time service performance,which promotes the development of Distributed Stream Processing Engine(DSPE).Usually,the stream-oriented computation frameworks are deployed on large-scale clusters or cloud platforms.The computation tasks are sent to distributed machines in the form of topology,and the pipeline model makes the output of the preceding step becomes the input of the next step.The current research work mainly focuses on processing optimization,efficiency improvement and real-time processing on system architecture layer,including distributed file storage,distributed topology definition,in-memory database techniques and so on.Although some research outcome has been integrated into real-time prototype systems and implemented in commercial applications,distributed stream processing is still facing three main challenges as following:1.Lack of adaptive algorithms with high throughput and low latency.Data skew is a com-mon phenomenon in applications,and the distribution of streaming data is changing dynam-ically and abruptly.For instance,the communication data varies between normal hours and rush hours,during the occurrence of specific events and sales promotion of e-commerce.In such cases,it is difficult to maintain a high throughput and stable low latency at all time.2.Lack of elastic system scalability.To deal with the ever-increasing growth of data size,enterprises need to make improvements on data reduction(scale-down),hardware upgrade(scale-up)and system expansion(scale-out).However,the speed of hardware development has lagged far behind the growth of data size.Therefore non-blocking agile scale-out of system is one of the key technology to ensure the availability of real-time processing system.3.The conundrum of usability assurance strategy.System malfunction in practice often lies in the failures of several nodes.For example,according to a statistic from Google,for an application running on 2,000 machines,the average number of machine failures per day is greater than 10;another statistic from Synergy distributed stream processing system indicates that the machine failure rate is up to 15%;in the deployed IBM System S stream processing system,the system log records 69 significant failure incidents during one month period.The failures of computing nodes in a cluster lead to computational imperfections.As the real-time processing highly calls for effective recovery,it becomes one of the difficult problems in stream systems.Compared to static batch data processing,stream processing requires flexible processing scheme,low-latency performance and high-efficient fault tolerance when handling real-time and unpredictable data.As Michael Stonebraker,the Turing Award winner in 2014,says,the require-ments of a good stream-processing system are fast processing of input data,low-latency generating of output results,scalability of parallel computing,adaptivity of computing resource to application demands,security and availability of data and so on.Therefore,we explore the problem of skewed load distribution which constrains the system performance,the influence of parallel processing scheme on join operations and fault-tolerance for resource availability in stream processing.We aim to develop a high-performance and high-availability platform with the best use of hardware resource in cluster.The main contributions of this thesis are summarized as following:1.We propose an efficient key-based load balancing strategy.In stream topology,data is usually routed and distributed at the granularity of key.For key-based operations,taking key as the granularity for load balancing can maintain the operational semantics to the utmost.However,adjusting the load to a balanced threshold is a bin packing NP-hard problem,which is not efficient due to the large adjustment granularity.In this thesis,we design a light-weight balancing adjustment strategy.In addition,to handle severe skewed data distribution,we make further improvement on load balancing based on the idea of "Split keys on demand,Merge keys as far as possible".The proposed method can decrease the extra cost of balanced processing while achieving excellent performance.2.We decrease the resource utilization for join operation by designing a new node organiz-ing mechanism.Data are grouped in some certain rules according to the predicate semantics of join operation.However,traditional key-based routing policies inevitably incurs massive data broadcasting,especially in non-equijoin processing,which puts great pressure on both network and memory.In this thesis,we adopt the matrix model and quickly generate optimal matrix scheme and the corresponding migration plan,which can reduce the resource usages under data dynamics.Besides,we explore the irregular matrix model,further reducing the resource usages while guaranteeing the correctness.3.We explore a fault-tolerance strategy for economical resource utilization on the premise of efficient data recovery.The fault-tolerance mechanism is the basic insurance system run-ning.Fault-tolerance is usually implemented through backup.Different applications require different latency on fault recovery.The challenges are to cope with conflicts between fault-tolerance accuracy and recovery latency.In this thesis load balancing and fault-tolerance are considered at the same time,and we provides low recovery latency with efficient balance processing.In summary,in this thesis we analyze the load balancing problem in distributed stream pro-cessing systems,and explore a set of high-performance real-time processing scheme based on key-based balancing strategy,join-matrix model and fault tolerance mechanisms.We also provide the theoretical support for our techniques.Through a rich set of experiments and comparisons with the other state-of-the-art techniques using both standard benchmarks and real data sets,we comprehensively verify the correctness and effectiveness of our proposed methods.

Keywords/Search Tags:

Distributed Stream Processing, Load Balancing, Matrix Model, Non-EquiJoin, Fault Tolerance

PDF Full Text Request

Related items

1	Fault Tolerance For Distributed Parallel Stream Processing Systems
2	Research On Load Balancing And Fault Tolerant Mechanism In Big Data Stream Processing System
3	Research And Implement Of Load Balancing Model Based On Active Replication Fault-Tolerance
4	Research On Fault Tolerance In Distributed Stream Data Processing
5	Runtime systems for load balancing and fault tolerance on distributed systems
6	Load Balancing Of Distributed Data Stream Processing System Research And Implementation Of Technology
7	The Operator Scheduling Strategy And Load Balancing Algorithms In Distributed Data Stream Processing
8	Design And Implementation Of Digital Organisms Traffic Scheduling System Load Balancing And Fault Tolerance Mechanisms
9	Research On Job Runtime Characteristics Based Performance Optimization In Big Data Processing System
10	Load Management Policies Of The Distributed Stream Processing System ARTs-SH