Font Size: a A A

Research On Task Scheduling Optimization Strategies For Apache Heron

Posted on:2021-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y T ZhangFull Text:PDF
GTID:2518306128974409Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of emerging information technologies,such as cloud computing,the Internet of Things(Io T),Artificial Intelligence(AI)and 5G,the traditional batch computing cannot be applied to the processing and analysis of massive real-time data.The Distributed Stream Processing Systems(DSPSs)provide support for this kind of data processing based on the streaming computing mode,which has been widely used in various fields of society and has become an important part of the big data processing ecosystem.However,the static task scheduling mechanism and algorithms are used by default in state-of-the-art DSPSs,which makes it impossible to dynamically schedule tasks in practical applications.The lack of dynamic scheduling algorithms has affected DSPSs in terms of performance,load balancing and resource utilization,and limited their application scenarios.Aiming at the problem,this dissertation takes Apache Heron,a promising open-source DSPS,as the research objective and proposes two dynamic task scheduling algorithms based on different heuristics while considering the structure of topologies,which are designed to reduce the overall communication overhead of the system to improve the data transmission efficiency and balance workload of the cluster.The difficulties and challenges of the default scheduling mechanism of Heron in practical are analyzed through experiments and observations to clarify the feasibility of optimizing performance by reducing the overall communication overhead of the system.According to this research direction,the default scheduling algorithm of Heron is abstracted to establish the basic task scheduling model to provide a theoretical basis for subsequent parts.Further,the task scheduling optimization problem in Heron is formalized as an NP-Hard problem based on the resource constraint model and the optimal communication overhead model.To solve the problem while satisfying the real-time requirement of the system,the task scheduling strategy based on data-stream classification for Heron(DSC-Heron)is first proposed.This strategy uses the data-stream classification algorithm to classify data streams based on the traffic at runtime,and then reschedule different types of data streams using the data-stream classification scheduling algorithm.In the process of rescheduling,the data streams with high traffic are mainly identified and aggregated into the same work node using the data-stream cluster assignment algorithm,so that this strategy can convert inter-node communication into intra-node communication to reduce the overall system communication overhead while meeting the resource constraints.Next,based on another heuristic that depends on the data-stream conversion value,the load-aware task scheduling algorithm for Heron(L-Heron)is proposed.The algorithm is based on the load-aware model and always greedily selects the tasks with the largest data-stream conversion for assigning,so that it can maximize the conversion of traffic between inter-node into intra-node to reduce the overall system communication overhead.On the other head,the load balancing model is used by L-Heron as the compact resource constraint for task allocation,hence L-Heron can product the performance improvement while balancing the workload of the cluster.Finally,the dynamic scheduling mechanism of Heron based on the MAPE(Monitor,Analyze,Plan and Execute)model is constructed,which provides an asynchronous adaptive dynamic scheduling process and a configuration-based scheduling algorithm deployment mode.To evaluate the effectiveness and applicability of DSC-Heron and L-Heron,the classic example topologies of Heron,the custom topology and the open-source streaming benchmark are used to conduct experiments on the system completion latency,inter-node traffic,average throughput and load balancing.Extensive experimental results show that the two proposed dynamic scheduling algorithms can improve the performance of Heron compared with the default scheduling algorithm.DSC-Heron is more suitable for stream applications with large data skew and can balance the workload of these topologies to a certain extent.L-Heron has a wide range of applicability for streaming applications,which can not only make up for the shortcomings of DSC-Heron,but also perform significant load balancing on topologies with large data skew.
Keywords/Search Tags:Big data, Stream processing, Apache Heron, Task scheduling, Load balancing
PDF Full Text Request
Related items