Font Size: a A A

Optimizing Load Balance In Distributed Stream Processing System

Posted on:2018-10-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H FangFull Text:PDF
GTID:1318330512485360Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As people pay more attention to the potential value in big data,real-time data analysis is playing an increasingly important role in data analytic applications.Some typical examples for stream-based data sources,like 3G/4G cellular data,network monitoring data,sensor data,etc.,are extremely large.For example,averaging 100,000 tweets are tweeted on Twitter every minute;the generation of Shanghai telecom user data is up to 2 million records per minute;the high-speed acquisition rate of Chinese currently largest solar optical telescope achieves 1GB per second in the chromosphere channel;the urban road vehicle monitoring system in Taizhou collects the vehicle license plate numbers and trajectories at the speed of 240 million records per second.In general,this type of data has some common features,including durative,randomness,variability,unpredictability and so on.Meanwhile,as streaming data has life circle and its value decreases as time goes,traditional database techniques can no longer satisfy the stream data processing due to its requirements on massive dynamic data storage and real-time service performance,which promotes the development of Distributed Stream Processing Engine(DSPE).Usually,the stream-oriented computation frameworks are deployed on large-scale clusters or cloud platforms.The computation tasks are sent to distributed machines in the form of topology,and the pipeline model makes the output of the preceding step becomes the input of the next step.The current research work mainly focuses on processing optimization,efficiency improvement and real-time processing on system architecture layer,including distributed file storage,distributed topology definition,in-memory database techniques and so on.Although some research outcome has been integrated into real-time prototype systems and implemented in commercial applications,distributed stream processing is still facing three main challenges as following:1.Lack of adaptive algorithms with high throughput and low latency.Data skew is a com-mon phenomenon in applications,and the distribution of streaming data is changing dynam-ically and abruptly.For instance,the communication data varies between normal hours and rush hours,during the occurrence of specific events and sales promotion of e-commerce.In such cases,it is difficult to maintain a high throughput and stable low latency at all time.2.Lack of elastic system scalability.To deal with the ever-increasing growth of data size,enterprises need to make improvements on data reduction(scale-down),hardware upgrade(scale-up)and system expansion(scale-out).However,the speed of hardware development has lagged far behind the growth of data size.Therefore non-blocking agile scale-out of system is one of the key technology to ensure the availability of real-time processing system.3.The conundrum of usability assurance strategy.System malfunction in practice often lies in the failures of several nodes.For example,according to a statistic from Google,for an application running on 2,000 machines,the average number of machine failures per day is greater than 10;another statistic from Synergy distributed stream processing system indicates that the machine failure rate is up to 15%;in the deployed IBM System S stream processing system,the system log records 69 significant failure incidents during one month period.The failures of computing nodes in a cluster lead to computational imperfections.As the real-time processing highly calls for effective recovery,it becomes one of the difficult problems in stream systems.Compared to static batch data processing,stream processing requires flexible processing scheme,low-latency performance and high-efficient fault tolerance when handling real-time and unpredictable data.As Michael Stonebraker,the Turing Award winner in 2014,says,the require-ments of a good stream-processing system are fast processing of input data,low-latency generating of output results,scalability of parallel computing,adaptivity of computing resource to application demands,security and availability of data and so on.Therefore,we explore the problem of skewed load distribution which constrains the system performance,the influence of parallel processing scheme on join operations and fault-tolerance for resource availability in stream processing.We aim to develop a high-performance and high-availability platform with the best use of hardware resource in cluster.The main contributions of this thesis are summarized as following:1.We propose an efficient key-based load balancing strategy.In stream topology,data is usually routed and distributed at the granularity of key.For key-based operations,taking key as the granularity for load balancing can maintain the operational semantics to the utmost.However,adjusting the load to a balanced threshold is a bin packing NP-hard problem,which is not efficient due to the large adjustment granularity.In this thesis,we design a light-weight balancing adjustment strategy.In addition,to handle severe skewed data distribution,we make further improvement on load balancing based on the idea of "Split keys on demand,Merge keys as far as possible".The proposed method can decrease the extra cost of balanced processing while achieving excellent performance.2.We decrease the resource utilization for join operation by designing a new node organiz-ing mechanism.Data are grouped in some certain rules according to the predicate semantics of join operation.However,traditional key-based routing policies inevitably incurs massive data broadcasting,especially in non-equijoin processing,which puts great pressure on both network and memory.In this thesis,we adopt the matrix model and quickly generate optimal matrix scheme and the corresponding migration plan,which can reduce the resource usages under data dynamics.Besides,we explore the irregular matrix model,further reducing the resource usages while guaranteeing the correctness.3.We explore a fault-tolerance strategy for economical resource utilization on the premise of efficient data recovery.The fault-tolerance mechanism is the basic insurance system run-ning.Fault-tolerance is usually implemented through backup.Different applications require different latency on fault recovery.The challenges are to cope with conflicts between fault-tolerance accuracy and recovery latency.In this thesis load balancing and fault-tolerance are considered at the same time,and we provides low recovery latency with efficient balance processing.In summary,in this thesis we analyze the load balancing problem in distributed stream pro-cessing systems,and explore a set of high-performance real-time processing scheme based on key-based balancing strategy,join-matrix model and fault tolerance mechanisms.We also provide the theoretical support for our techniques.Through a rich set of experiments and comparisons with the other state-of-the-art techniques using both standard benchmarks and real data sets,we comprehensively verify the correctness and effectiveness of our proposed methods.
Keywords/Search Tags:Distributed Stream Processing, Load Balancing, Matrix Model, Non-EquiJoin, Fault Tolerance
PDF Full Text Request
Related items