Font Size: a A A

Research On Performance Prediction And Tuning Of Hadoop

Posted on:2020-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LiFull Text:PDF
GTID:2428330590496405Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,Hadoop distributed computing platform has been widely used in industry to solve problems related to large-scale data processing.A large computing cluster is not only a financial test but also a valuable resource for every enterprise.In order to improve the utilization efficiency of cluster resources and to effectively manage the use of cluster,it becomes particularly important to predict the execution time of tasks needed to be executed.Meanwhile,optimizing the performance of cluster is also an important way to improve its utilization efficiency.In this paper,the performance prediction and parameter tuning of Hadoop platform are studied through simulation technique.The specific work mainly includes the following three aspects.1.Research on simulation methods based on Hadoop operation process.In this paper,the main components of Hadoop,such as resource scheduling manager(YARN),cluster network transmission model,distributed file system(HDFS)and MapReduce process,are simulated in detail.And the complete process of the actual cluster execution is simulated by event-driven simulation method to realize the accurate prediction of job running time.2.Research on the prediction of running time of MapReduce job.MapReduce,the main operation mode of Hadoop system,is the most complex part with the greatest impact on the performance of cluster.So the running time characteristics of MapReduce is studied in this paper.Firstly,the MapReduce process is divided into Map tasks and Reduce tasks.Then the relationship between the execution time of Map tasks or Reduce tasks on a single node and the number of parallelism is analyzed,and the prediction model is established to realize the prediction of the operation time of MapReduce process.In order to verify the accuracy of the simulation prediction method,experiments were carried out on three cluster-scale environments with 35,50 and 80 nodes respectively,and Terasort,Wordcount and Hive were used as the system test case types.The experimental results show that the error rate between the simulation result and the real-measured value is less than 10%.This result show that the simulation method can realize the accurate simulation of large-scale data center with good universality,reliability and scalability.3.Research on automatic tuning of Hadoop parameter.This paper presents a parameter auto-optimization method based on Microoperation.Firstly,the operation process of the whole job was broken down into several Microoperations so the change of parameter value could be analyzed quantitatively.Then,by reconstructing the operation process based on the Microoperation model,the relationship between parameters and the execution time of the whole job was established.Finally,the optimized system parameters could be obtained by applying various searching optimizing algorithms.In order to verify the validity of this method,experiments were conducted with two types of jobs,terasort and wordcount.Compared with the default parameters,this method reduced the job execution time by at least 41% and 30% respectively.Experimental results show that this method can effectively improve the job execution efficiency of Hadoop and shorten the time of job execution.
Keywords/Search Tags:Hadoop, simulation, performance prediction, parameter tuning, MapReduce
PDF Full Text Request
Related items