Font Size: a A A

Research Of Hadoop Configuration Tuning And Job Scheduling Based On Performance Evaluation

Posted on:2021-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2428330602483749Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hadoop is a distributed system framework,which is widely-used in parallel processing of big data.MapReduce is a programming model of Hadoop,and its performance is significantly affected by configuration parameters.However,the huge parameter space and interact of parameters make it impossible to explore all the parameter combinations manually.Besides,a real running of a job will produce huge overhead,so we have to build models to predict the job performance,instead of estimating every configuration by actual performance.Usually,we use job prediction time to express the job performanceNow there are about two kinds of job execution time model.The one depends on formula deduction,which constructs a well-defined equation for time prediction using adequate knowledge of MapReduce work flow.However,there are hundreds of parameters affecting the Hadoop performance,the mathematical equation can't contain all the Hadoop configuration parameters,leading to an ignorance of some important parameters which have a significant influence on job execution.By the way,it needs model builder have adequate knowledge of job work flow,which require high capability The other method takes all parameters having important influence on job performance as the input of prediction model,then by training data set we get the model of job prediction time and parameter configuration.Most recent studies only take parameters into consideration,ignoring the available resource,which also has a great influence on job performance.What's more,the recent studies only tune parameters for single job,but there are multiple jobs in a cluster,so tuning parameters for every job is impossible Aiming at solving the problems above,this paper makes in-depth study1.This paper proposes a framework of Hadoop configuration tuning and job scheduling based on performance prediction,for the purpose of tuning configuration and scheduling orders for MapReduce jobs,which leads to the best job performance The framework consists of three components:test running module,configuration tuning and job scheduling scheme generation module,and scheme execution module.The test running module obtains the benchmark data of job execution time prediction.The configuration tuning and job scheduling scheme generation module generates the configuration scheme and job scheduling order scheme for single job and multiple jobs,respectively.The scheme execution module executes as the scheme generated by scheme generation module2.Aiming at single job performance optimization,we propose RJHCT to tune the optimal parameters.Concretely,we predict single job execution time by packing algorithm,and regard job execution time as the fitness value of the genetic algorithm.We use random forest model to obtain the influence weight on job performance of each parameter,and regard it as the mutation probability of the genetic algorithm.Finally,obtain the optimal solution by iterative searching.The algorithm outputs an optimal parameter configuration,which leads to the shortest job execution time3.Aiming at multiple jobs performance optimization,we propose an algorithm based on two-stage coding reinforcement learning and genetic algorithm to evaluate the scheduling sequence and parameter configuration.Concretely,we divide the chromosome into two parts:scheduling orders and parameter configuration.And the fitness value is divided into two parts:the current value of the chromosome and the prospect value of the scheduling orders.Before each crossover operation and mutation operation,we calculate the inheritance prospect value and mutation prospect value,and based on these values,we calculate the probability of being selected for each chromosome.The algorithm outputs the optimal scheduling sequence and parameter configuration,which leads to the shortest execution time of the whole job sequence.Finally,we evaluate the models above by real data size.The experimental results show that the model accuracy is higher than traditional methods,and the performance has been improved compared with default configuration.Through the above research,this paper find an optimal parameter configuration which leads to the shortest job execution time,and it not only improves the cluster performance,but also saves the time and resource cost.
Keywords/Search Tags:Hadoop, job execution time prediction, parameter tuning, job scheduling
PDF Full Text Request
Related items