Font Size: a A A

Optimization And Research Of Hadoop Scheduling Algorithm In Hadoop Heterogeneous Environment

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:H D SongFull Text:PDF
GTID:2428330605955969Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,more and more people realize the importance of data,and many enterprises and research institutions around data analysis and processing reapply,and traditional data processing methods are not enough to cope with the current Massive data.As an efficient distributed computing platform,Hadoop has gradually become the tool of choice for processing large-scale data sets.The most important thing in Hadoop cluster is the data processing speed,and one of the important factors that affect the data processing speed is the Hadoop scheduling algorithm.Therefore,the improvement and optimization of Hadoop scheduling algorithm is currently a relatively popular research direction.Due to the continuous growth of data volume,the computing performance requirements of the Hadoop cluster are getting higher and higher.Only new machines are constantly added to the cluster to improve the computing performance of the cluster.In this way,the cluster composed of differently configured machines is called heterogeneous cluster.Hadoop's default scheduling algorithm is based on a homogeneous cluster,and the performance of each node in the default cluster is the same.However,the computing performance of nodes is different in heterogeneous clusters.If the default scheduling algorithm is still used,it may cause highperformance nodes to be idle and low-performance nodes to be busy.This not only wastes cluster resources,but also indirectly extends the execution of jobs time.In view of the above problems,this thesis proposes a dynamic task scheduling algorithm based on node classification in a heterogeneous environment.In order to make better use of high-performance nodes in the cluster,the algorithm divides the cluster nodes into different levels according to the performance through the hardware parameters of CPU,memory and disk,and takes the classification results as one of the considerations.At the same time,the principle of Hadoop's default scheduling algorithm to select nodes is that data locality takes precedence and there is a certain degree of randomness.In order to allow high-performance nodes in the cluster to undertake more computing tasks without reducing the proportion of data localization tasks,a dynamic adjustment strategy for data blocks based on frequency of use is adopted to adjust the number of data blocks according to the frequency of use of data blocks within a certain period of time And set the data block placement order from high-performance nodes to lowperformance nodes.When task scheduling is performed,the real-time load value of the node is collected through the heartbeat mechanism,and the three factors of node load value,node performance,and data localization are comprehensively considered to select the appropriate node to perform the task.By building a Hadoop heterogeneous cluster,the dynamic scheduling algorithm based on node classification and the default scheduling algorithm of Hadoop are compared and experimented.Experimental results show that the scheduling algorithm proposed in this thesis can improve the utilization rate of high-performance nodes while maintaining data locality,not only can shorten the average execution time of jobs,but also improve the scalability of Hadoop heterogeneous clusters.
Keywords/Search Tags:Hadoop, Heterogeneous cluster, Task scheduling, High performance node
PDF Full Text Request
Related items