Font Size: a A A

Zone Division And Dynamic Load Scheduling Algorithm Based On Heterogeneous Spark Cluster

Posted on:2020-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhuFull Text:PDF
GTID:2428330596975459Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of big data technology,various big data processing frameworks have emerged.Spark is currently the most mainstream big data processing computing framework.Spark supports in-memory computing,provides interactive computing and querying,supports rich data manipulation,and provides services such as data mining,machine learning,and stream computing.The heterogeneity of a computer cluster means that the computers in the cluster have different hardware configurations,which makes the performance of these computers different in Spark jobs.The development of cloud computing and the use of data centers make clusters more heterogeneous.The rise of machine learning has made clusters with computers with mixed CPU and GPU architectures,making clusters heterogeneous.Cluster load balancing is the allocation of jobs or tasks to multiple computing units for execution,increasing throughput,and improving data processing and availability.For resource isolation and resource reuse,the runtime environment of the software architecture is becoming more and more complex,and Spark may run under complex and variable load environments.By analyzing the source code of Spark,two problems are found: Spark's resource allocation strategy based on the number of homogeneous processor cores cannot adapt to heterogeneous cluster environment;Spark's task scheduling lacks load-based task scheduling strategy.In order to optimize the above two points,this thesis proposes a Zone Division and Dynamic Load Scheduling Algorithm Based on Heterogeneous Spark Cluster.The algorithm consists of two parts,which are zone-based job scheduling and task scheduling based on dynamic load.Zone-based job scheduling includes dividing the cluster into zones and zone-based job resources allocation.A zone is a grouping of computers in the cluster that have the same number of processor cores and have similar performance in the benchmark.The zone-based job resources allocation refers to assigning computing resources from a specified zone or adjacent zone to a Spark job according to the user's configuration.Through zone-based job scheduling,the heterogeneity of the cluster is fully utilized to accelerate the running speed of the Spark job,and the user can allocate computing resources of different performances according to the priority requirements of the job.The dynamic load task scheduling means Spark performs task scheduling based on the load information collected periodically on each node of the cluster.This load scheduling method will enable Spark to avoid or reduce the use of high-load nodes and schedule more tasks to run on low-load nodes,thus speeding up the operation of Spark jobs.
Keywords/Search Tags:Spark, heterogeneity, load balance
PDF Full Text Request
Related items