Font Size: a A A

The Operator Scheduling Strategy And Load Balancing Algorithms In Distributed Data Stream Processing

Posted on:2008-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:H F DengFull Text:PDF
GTID:1118360272466635Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Stream-processing systems are widely applied in many fields such as financial management, network monitoring, communication data management, web application, sensor network and so on. The distributed stream processing technology has been produced as the development of the computer network and distributed computing technology. Researches on distributed stream-processing systems become the necessary trend in the data stream processing domain since the sources and application of the data stream systems are distributed and the scale of the applications enlarges quickly. The development of the distributed stream-processing systems, which will be applied to such fields of the national economy and the people's livelihood as the military, finance, etc, is barely underway in the world.The choice of an operator scheduling strategy has significant impact on the runtime memory consumption as well as the output latency of the stream systems. We design a scheduling strategy, called Golden Mean (GM), to balance between the memory usage minimization and latency minimization by taking the situation of the future workload, the current memory consumption and QoS requirement of the query into account. In the GM scheduling strategy, the executing order of the operators is decided uniformly according to the Scoring Function. The static parameters in the scoring function can be customized to meet the need of various application scenarios, while the dynamical parameters can be adjusted according to the current system status automatically. In addition, the priority levels of queries can be guaranteed by GM.Since stream based applications usually involve large volumes of data which are known to exhibit bursty behavior and require timely response, the CPU might not be fast enough to process all incoming data items in a timely manner. Load management is the focus of research in both the distributed stream processing systems and centralized stream processing systems. Although researches on the load management in the data stream systems is similar to that of traditional parallel and distributed systems in many aspects, essential differences exsit between them. All the proposed load balancing algorithms can be broadly characterized as static and dynamic. Forcasting the workload is the basis for the static load balancing algorithms and is very important for dynamic load balancing algorithms. We propose a new metric, the weighted time-to-performance ratio, to measure three revised linear time series models: the optimal moving average model, the optimal exponential smoothing model and the revised grey model GM(1,1), because we argue the efficiency of the algorithms should be considered in the stream systems where data requires timely processing. These algorithms perform well according to the weighted time-to-performance ratio in the distributed stream-processing field.We design a novel architecture for the large-scale distributed stream-processing systems. The whole system consists of a group of heterogeneous computer clusters. The whole system can achieve the global load balancing by balancing every cluster which consists of several homogeneous servers. The main goal of every cluster is exchanging the resources for the performance. In the cluster, enough servers are employed to get rid of the occurrence of overload phenomenon, so techniques for load shedding are not necessary in the system. In the meanwhile, the number of active servers is decided by the practical load level and some servers can be put into the sleep mode for the sake of energy conservation when the load is rather low.We propose a static algorithm of high efficency and high performance based on many heuristic algorithms we have studied. Firstly, tasks are assigned averagely to the machines according to a special initialization policy. Then the optimal criterion for exchanging tasks between two machines is proposed and exploited to speed up the improving process towards load balance.We also propose a job-combination based static algorithm for load balacing. Firstly, almost all jobs are organized into the standard job combinations, each of which consists of one to four jobs. Then they are assigned to the machines according to the assignment algorithm for job combinations, which is a special integer partition algorithm.
Keywords/Search Tags:distributed stream-processing system, operator scheduling, load forcasting, task scheduling, system architecture, load balancing
PDF Full Text Request
Related items