With the explosive growth of data in various industries and the increase of business demands,data storage and computing technologies are facing new challenges,and various big data technologies are developing rapidly.Spark is a distributed big data computing framework based on in-memory computing.It is widely used by all walks of life due to its rich computing functions and reliable performance such as machine learning,graph computing and stream computing,and it is also accompanied by problems.The hardware devices that provide the basic services of big data technology will be updated,iterated and expanded over time,making each node in the Spark cluster have different hardware configurations,resulting in cluster heterogeneity.The performance of computing nodes is different,and Spark jobs will have different performances on different nodes.By analyzing the source code of Spark,Spark’s resource scheduling strategy based on a homogeneous structure will cause uneven resource allocation in heterogeneous clusters,which will affect the load and then affect the efficiency of job execution.Therefore,based on Spark’s default resource manager,the resource scheduling of the cluster is improved by considering the impact of heterogeneous computing nodes.The default resource scheduling strategy of Spark only considers the number of remaining CPU cores of computing nodes during resource scheduling.Faced with the problem of insufficient consideration of heterogeneous clusters.This paper first proposes a method for comprehensive evaluation of node load.The load evaluation index of heterogeneous nodes is established by comprehensively considering the static performance of nodes and the dynamic load information at runtime.Then the weight of evaluation index is determined by the cluster analytic hierarchy process(AHP),and a quantitative model of the real-time load of the computing node is obtained.Then,adding feedback mechanism to the Spark’s original resource manager.The order of the computing nodes of Spark during resource allocation is adjusted regularly base on the size of the real-time load quantified by the load comprehensive evaluation method,and realize the dynamic resource scheduling strategy based on the comprehensive evaluation of node load.Finally,through comparative experiments on the deployed Spark platform,it not only proves that the strategy effectively alleviates the load balancing problem and improves the execution efficiency of the cluster,but also proves that the strategy has good scalability. |