Font Size: a A A

Research And Improvement Of Job Scheduling Algorithm Based On Hadoop

Posted on:2020-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:J T RuanFull Text:PDF
GTID:2438330572999548Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the arrival of the era of big data,cloud computing technology has achieved unprecedented development.As the core technology of cloud computing platform,Hadoop has also been widely applied and developed.Hadoop platform connects a large number of computers into a cluster through the Internet,and users can submit jobs to the cluster through the client to complete the actual application requirements.As the core component of Hadoop platform,resource scheduler adopts specific scheduling algorithm to allocate and execute jobs.Thus,scheduling algorithm directly affects the performance of the entire cluster.So the research on Hadoop job scheduling algorithm becomes very important.As a framework for parallel computing,Hadoop MapReduce is used by more and more applications for distributed processing of data.In the process of map/reduce program execution,due to a large amount of data assigned to some reduce nodes,the load on nodes is unbalanced,resulting in data skew,which will directly affect the overall completion time of the job.Therefore,solving the problem of data skew is also the focus of current research.In this paper,the resource scheduling algorithm of cloud computing and the data skew problem are studied as follows:(1)Combined with the source code and model of the three default resource scheduling algorithms of Hadoop,the author studied and analyzed them,summarized their advantages and disadvantages,and proposed the cloud computing resource scheduling algorithm in this paper on the basis of them.(2)Aiming at the data skew phenomenon under the MapReduce programming model,this paper starts from the ideal environment and the real production environment without considering the network bandwidth,data migration time between nodes and other factors,and proposes the load-balancing scheduling algorithm MR-LB based on the ideal environment and the load-feedback scheduling algorithm MR-LBF.(3)An improved hybrid optimization GA-PSO algorithm is proposed.The algorithm is based on traditional particle swarm optimization and genetic algorithm,by analyzing the process of resource scheduling in cloud computing,the process model into the fitness function in genetic algorithm in children,and suitable for the operator is adopted according to the actual application scenario,finally uses the serial type hybrid optimization approach is applied to the cloud computing resource scheduling strategy.Finally,through different experimental platforms,virtual machines to build a cluster server and CloudSim cloud computing simulation experiment platform,first of all,the data skew solution is analyzed and verified.The experimental results show that the impact of data skew on the system can be reduced to a large extent.The next step is to compare and analyze the hybrid optimization algorithm model with the built-in scheduling algorithm of Hadoop.Compared with the traditional scheduling algorithm,the resource scheduling algorithm proposed in this paper has further improved the resource utilization rate and the overall completion time of cluster jobs,indicating that this resource scheduling algorithm has certain feasibility and high efficiency.
Keywords/Search Tags:Cloud Computing, Hadoop, MapReduce, Data skew, Genetic algorithm, Particle swarm optimization, Resource scheduling
PDF Full Text Request
Related items