Font Size: a A A

Research Of Task Scheduling Strategy For Heterogeneous Cluster In Spark Computing Environment

Posted on:2019-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:K K LiuFull Text:PDF
GTID:2428330545473852Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Along with the mobile Internet rapid development,4G networks are widely used,thought the future 5G network popularization and vigorously promote the Internet of things technology,various types of data into exponential growth in all walks of life,marked the arrival of the era of big data.From the large-scale enterprise data,the production data and the technical experience data mining has the commercial value,the use value,has become the current Internet development trend.As the big data and cloud computing technology development,the steady accumulation of history data,the continuous improvement of computing power,machine learning,deep learning and artificial intelligence applied to life,production,every aspect of business.With data volume increase sharply,the data center hardware resources are expanding and upgrading,and over time,the level is not easy,loss of components on the computing power of all nodes under the Spark cluster,unequal system currently available resources,communication overhead is wrong,and so on and so forth,cluster heterogeneity has become increasingly significant.The task scheduling strategy in Spark is divided into two types:FIFO(first-in-first-out)scheduling strategy and Fair share mode scheduling strategy.Therefore,the default task scheduling mode of Spark does not consider the cluster heterogeneity and the current resource usage of nodes.The high performance node is in idle state,and the low performance node is in overload condition.The operation execution efficiency is not efficient,and the short board effect is serious.Aiming at the above problems,this paper proposes to allocate compute nodes for tasks in heterogeneous cluster environment according to the current performance of nodes and the Spark task complexity algorithm(NPTC algorithm).The main work is as follows:First,the node performance is determined according to the available resources of the current system of the cluster node.Current system resources such as Cpu utilization,memory utilization,mononuclear queue length and speed of network IO,etc.The influence factors of node,the node monitoring detection module,and the nodes judging module to determine the ability to receive and quantitative cluster to weight the performance of each node receives the task size.Secondly,the complexity of tasks is determined according to the transformation function of RDD and the size of RDD dataset.Third,the scheduling policy of assigning the Spark job according to the node performance.According to node weights of current performance in the overall performance of the sum of the weights of Spark cluster node is assigned to a task quantity of ratios,the distribution of nodes perform the quota in addition quantity proportion and node properties of corresponding weights.Think in terms of node performance and match the quota,heterogeneous clusters will amount to complete the assigned tasks in homogenous cluster way,high as far as possible let each node the node receives the task under heterogeneous cluster roughly equal time to maximize the entire cluster of computing power,reduce overall job execution time.
Keywords/Search Tags:Spark, Heterogeneous Cluste, Node Performance Decision, Task Complexity, Task Scheduling
PDF Full Text Request
Related items