Font Size: a A A

Research On The Key Technology Of Distributed Data Stream Processing Systems Supporting Windows

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2428330569998867Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,sensor technology and network technology,more and more fields have emerged in real-time processing of massive and high-speed online data,which is characterized by continuous,unpredictable,fast arriving and time-varying,called data stream.The window technology keeps the most recent data in memory to speed up the processing rate,which becomes the hotspot of real-time data stream processing.Today,distributed computing becomes the mainstream technology of data stream processing,and providing window support in distributed data stream processing systems is a serious challenge.First,it needs to spend a lot of communication overhead in order to ensure the correct window semantics.Second,in order to support user-defined operators,it needs to decouple the window logic with the processing logic,which requires the window technology to be versatile and extensible.Third,to improve the resource utilization under the time-varying arrival rate of data,it requires the window technology to be flexible.At present,the research on the distributed processing of window is mainly oriented to specific operators,which lacks the generality and expansibility.Users need to re-implement window logic when developing window-based operators,which greatly increases the developer burden.In addition,the existing window distributed processing technology only supports a few window types,which cannot meet user needs.To this end,this paper focuses on the general and scalable distributed processing technology over window,and tries to integrate window support in distributed data stream processing systems through deeply studying the distributed processing and general processing technology.To provide the window support in distributed data stream processing systems,it requires the window be versatile and extensible to support user-defined operators.To this end,this paper presents a general framework for distributed sliding window over data stream called GDSW(General Distributed Sliding Window),which provides different processing models for different types of operators.In order to implement real-time processing under different data rate,GDSW can be configured with proper window parallelism.GDSW divides window-based operators into two categories: data-independent operators and data-dependent operators.For the data-independent operators,this paper proposes the Equal Partition and Round Robin sending(EPartition-RR)algorithm,which partitions the window with equal size and sends data in Round Robin strategy.The advantages of EPartition-RR algorithm is that it can ensure the correct window semantics and does not add any additional cost.For data-dependent operators,this paper presents a sliding window index algorithm(SWindowIndex)and the input trigger algorithm(InputTrigger).The SWindowIndex algorithm has the ability to process data in real time,which updates the window state immediately upon receiving a new data.But it requires twice as much communication overhead as centralized window.For a window with a sliding step greater one,the eviction is periodic,it would cost additional communication overhead if discarding the oldest data immediately when receives a new data,which can be handled by InputTrigger algorithm by eliminating the eviction signal.Experimental results show that GDSW can achieve real-time processing under high-speed data stream and large window by increasing the window parallelism,and it can increase 10 times more in throughput compared with centralized processing model.Load balancing is one of the important goals of distributed system,it can not only improve resource utilization,but also effectively improve system throughput and reduce processing latency.Fixed data distribution algorithms cannot adapt to the ever-changing distribution of data stream.Therefore,this paper proposes a dynamic load balancing algorithm based on buffer usage,called BufferUsageDLB(Buffer Usage Dynamic Load Balance).BufferUsageDLB algorithm constantly obtains the input buffer to achieve load balancing detection,and re-allocates the task through collecting,analyzing,re-learning and re-deploying.In addition,BufferUsageDLB algorithm calculates the computing ability of node to achieve more accurate task allocation,and statistics the data distribution to deal with the uneven data using well-know SpaceSaving algorithm.Experimental results show that BufferUsageDLB algorithm can accurately detect distribution changes and maintain good load balancing,and increase throughput by 80% to 150% and reduce processing latency by at least 80% compared with existing algorithms.The dynamic changing of the data rate requires the system to be flexible.When the parallelism of the window is not high enough to deal with the high-speed data stream,it should increase the parallelism.Conversely,when the resource is allocated too much,it should retrieve some resource to improve the resource utilization.To solve this problem,this paper proposes EWindow algorithm,which calculates the appropriate window parallelism according to the data rate and the working capacity of nodes.The experimental results show that EWindow algorithm can adaptively change the window parallelism under different data rate,and can keep the resource utilization more than 70%.In order to verify the theoretical research of this paper,we design and implement a distributed data stream processing system called WindowStorm based on Apache Storm,which uses a hierarchical architecture.The underlying layer is data stream processing engine,which focuses on Apache Storm.In the above,the GDSW framework is realized to achieve distributed processing of window.The top layer is the application layer,which provides the most widely used window-based operators.Users can also inherit and implement the window interface to extend the operators.In addition,WindowStorm provides load balancing and elastic extension modules to cope with dynamic data stream.Dynamic load balancing module uses BufferUsageDLB algorithm,which dynamically adjusts the distribution schema by monitoring the change of data distribution.Elastic extension module uses EWindow algorithm,which automatically adjusts the window parallelism to improve resource utilization.Experimental results show that WindowStorm can effectively reduce the workload of developing window-based operators,and it can keep the resource utilization higher than 70% and control the processing latency within milliseconds under dynamic data stream.
Keywords/Search Tags:Window Technology, Data Stream, Distributed Processing, Dynamic Load Balance, Elastic Scaling
PDF Full Text Request
Related items