Research On Key Techniques Of Automatic Optimization For Big Data Analysis Engine

Posted on:2018-01-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z D Bei

Full Text:PDF

GTID:1368330533955885

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Big data is the current hot topic,analysis engine such as Spark or Hadoop is the key to explore the value of data.To meet the different needs from in-dustry and academia,Big Data analysis Engine(BDE)has been developed into a diversified ecosystem.These systems share a common characteristic:a set of performance-related configuration parameters that need user to configure.There are four major challenges in parameter configuration for BDE:(1)the search space of the optimal configuration of is huge;(2)big data applications are di-versified,and different programs require different configuration;(3)The impact of data size on the optimal configuration of different big data engines is not the same;(4)The changes of cluster hardware feature need to adjust the configu-ration of the system.Thus,the default or manual configuration would make the system to execute too slowly or even incorrectly.Therefore,the study of auto-configuration for BDE is very necessary for both industry and academia.Currently,there are three automatic optimization methods for configura-tion parameters:rule-based method,model-based method and adaptive search method.The rule-based method and adaptive search method do not take into account the complex interdependencies between parameters,and the number of optimized parameters is very limited,limiting the optimization space for perfor-mance.The model-based approach mainly includes:Analytical Model(AM),Sta-tistical Reasoning model(SR)and Machine Learning model(ML).The optimiza-tion capabilities of AM and SR are limited since they are based on over-simple linear assumptions,making the accuracy of the model very low and the result far from the optimal configuration.To address these problems,the ML based approach was proposed.In contrast to other approaches,the ML is not based on any linear assumptions and can automatically learn the relationship between parameter configurations and performance to construct an accurate performance model,which is more suitable for the configuration optimization of BDE.However,there are many problems when applying ML in the configuration optimization for BDE including:model is not accurate enough,the input data size is not considered,big data cause over-large time overhead.To address these problems,this paper will study from three aspects:automatic tuning for on-disk BDE,automatic tuning for in-memory BDE and online automatic optimization for BDE.First,we propose automatic tuning method based on the ensemble learning model named RFHOC for the on-disk BDE.The accuracy of the AM and the support vector machine model is not accurate enough in the self-tuning of the on-disk BDE.To address this problem,RFHOC builds an accurate performance model based on random forest algorithm,and realizes the automatic optimization for Hadoop program in a given cluster.Specifically,RFHOC first builds an ensemble performance model based on the random forest algorithm for the map and reduce phases of the MapReduce workflow.Then,RFHOC uses genetic algorithm to automatically search the optimal configuration in Hadoop configuration parameter space based on the performance model.The experimen-tal results show that the performance model of RFHOC is significantly higher than that of the AM.The average error of map stage is 4.8%,and reduce stage is 8.7%.The evaluation of RFHOC using five typical Hadoop programs,each with five different input data sets,shows that it achieves a performance speedup by a factor of 2.11� on average and up to 7.4x over the AM based approach.In addition,RFHOC's performance benefit increases with input data set size.Second,we propose data-aware ensemble learning based automatic tuning method named DAC for the in-memory BDE.Compared to the on-disk BDE,the optimal configuration of the in-memory BDE is more sensi-tive to the input data size.To address this problem,we propose a Data-Aware auto-Configuring(DAC)technique based on ensemble learning to automatically configure in-memory BDE for optimized performance.The key is that DAC takes the impact of input data size on configurations into account when finding the optimal configuration for a in-memory BDE.The insight is that we find the configurations of memory-based BDE for optimal performance are more sensitive to input data size compared to on-disk BDE,making simply extending the con-figuration auto-tuning approach for on-disk BDE to the in-memory BDE failed for achieving the optimal performance.We employ 6 Spark programs,each with five input data sets,to evaluate DAC.The results show that DAC speeds up these 30 program-input pairs with default configurations by a factor of 30.4� on average and up to 89 x.Finally,we propose an online incremental modeling and automatic configuration named OSC for online configuration of BDE.The input data size of an application might change at each run makes even an off-line auto-configuration approach hard to be used in practice.To address this is-sue,we propose an On-line Self-Configuring approach dubbed OSC that can automatically determine the optimal configuration parameter values for a given application.OSC combines three key techniques.First,it leverages ensemble learning to build a precise performance model for the application.Second,OSC quantifies the importance of the parameters and interaction intensity between them in terms of performance to accelerate genetic algorithm for searching the optimal configuration parameters.Third,OSC invents an incremental modeling approach to achieve better trade-off between accuracy and overhead of the mod-els.As such,OSC can learn the characteristics of an application and accordingly optimize its performance by automatically adjusting the configurations in an on-line manner.We have implemented OSC atop Hadoop2.6 and it can be used in practice.Our experimental result show a speedup over MROnline by a factor of 1.6� on average and up to 2.2x.In addition,more performance benefits can be obtained by OSC when the input data size of an application increases.

Keywords/Search Tags:

Big Data Analysis Engine, Ensemble Learning, Random Forest, Genetic Algorithm, Automatic Configuration

PDF Full Text Request

Related items

1	Application Of Learning-to-rank Method Based On Random Forest In Self-made Dataset
2	Symbiotic Forest:A Lightweight Decision Tree Ensemble Method
3	Visual Interpretation And Analysis Of Random Forest
4	The Application Of Random Forest Algorithm In Body Posture Recognition Research
5	Research On Positive And Unlabeled Learning By Random Forest
6	Research On Anomaly Detection Based On Ensemble Learning Algorithms
7	Research On Random Forest Classification Algorithm Based On Spark Distributed Platform
8	Research Of Air Quality Prediction Model Based On Ensemble Learning
9	Research On Ensemble Learning Algorithm Based On Sparse Representation Residual Reconstruction
10	Research On The Random Forest Based Detection Of Malicious Mobile Applications At Runtime