| New applications,such as big data,can extract enormous and valuable information,which is ignificantly important for the economic and social development.However,it is difficult to develop an efficient software system to support all new applications,since the flexibility of the applications.The general method is to develop a highly configurable framework for data processing.The frameworks provide lots of configuration parameters,through which.all users can tailor the framework to adapt the requirment of the applications.Thus,all applications can be run on the framework efficiently.However,to properly configure the framework,users need to deeply understand the framework details,such as framework structure,the execution process of the application,and so on.These details are challenging for common users.The automatic configuration methods can make users properly tailor the framework without understanding the details.These methods employ heuristic search strategies to explore the configuration space of the frameworks to find the best configuration for the applications.During the heuristic search,it is important to accurately evaluate all candidate configurations and provide them as the feedback of the searching.To efficiently and accurately evaluate the candidate configurations,it is important to build the performance model for the framework.The performance model can predict the framework performance for any configurations with high accuracy.The complex prediction models need to be employed,since the relationship between framework configuration and performance is highly complicated,and only complex prediction models can represent this complicated relationship.However,to achieve high prediction accuracy,the complex performance model need to be trained with a large number of examples,which lead to much overhead on data collection.These large-scale training examples also increase the overhead of model training.Hence,the automatic configuration methods is costly because of the data collection and model training,caused by the complex prediction model.This thesis centers on the study of optimizing and acclerating the automatic configuration methods for big data processing frameworks.To reduce the data collection overhead,a novel performance modelling method based on active sampling can be introduced.The novel method combines the data collection and model training,since the previous methods treat them independently.In detail,it performs the data collection and model training iteratively.In each iterative step,it collects the training examples according to the requirements of the model update.Then it updates the performance model with the collected data and starts the next iterative step.Compared to the previous performance modelling methods,the proposed method can collect the training examples according to the dynamic requirements of the performance model during the training process.Thus,it can make full use of the collected data and reduce the redundancy between the collected examples.This means that the performance model can achieve high accuracy with less training examples,and the data collection overhead is always reduced,compared to previous methods.The experiment results show that,the proposed method can reduce the overhead of data collection by about 15%,and increase the prediction accuracy of performance model by about 1%.To reduce the model training overhead,it can accelerate the model training on two aspects,the model parameters and training examples.Firstly,the convergence difference between model parameters can be used to reduce the redundant model update and improve the performance of model training.During the iterative update of model parameters,different parameters converge with different rate.It means that some model parameters converge with only a few iterations after the beginning of model updating,but another model parameters converge at the end of model training.Previous methods treat all model parametrs as a single unit and do not consider the convergence difference between model parameters.This may cause the redundant update to the model parameters that have been converged during the iterative update for the model.Therefore,a novel method which can accelerate the model training process based on convergence difference between model parameters is proposed.It can improve the efficiency of model training by exploiting the convergence difference between model parameters.Specifically,all model parameters can be divided into several blocks according to the convergence rate.In each parameter block,all model parameters converge with similar rate.During the model training,all blocks are iteratively updated independently,and each parameter block is treated as a single unit for convergence determination.The model division can reduce the redundant parameter update and improve the efficiency of model training.Compared to the traditional model training methods,the independent update of parameter blocks may introduce additional noise to the model,because the progress of block update is inconsistent.But it can be proved theoretically that the additional model noise does not affect the convergence and correctness of the model training.The experiment results show that,compared to traditional parameterlevel parallel method for model training,the proposed method can increase the training efficiency by 3 times,,and increase the prediction accuracy of performance model by about 2%.Moreover,the proposed method can achieve nearly ideal speedup on parallel platforms.On the aspect of training examples,the diversity of training examples can be exploited to improve the performance of model training.Specifically,in each iterative step,previous methods need to compute the gradient on all training example to update the model.This can lead to much computation overhead because of the large-scale training examples.However,after only a few iterative steps,most examples contribute little to the gradient and have little impact on the model update,but previous methods always compute on these unimportant examples for the gradient.This may decrease the efficiency of model updating and can not make full of the resource.The above example diversity can be exploited to improve the performance of model training.Hence,a novel method which can accelerate the model training based on sample diversity is presented.In detail,the gradient of unimportant examples can be computed every a few iterations,not in each iterative step.After the computation,the gradient of these examples can be resued in the following iterative steps until the next gradient computation.The strategy of gradient reuse can reduce the computing overhead during the iterative update of the model and improve the performance of model training.Compared to the traditional model update,the gradient reuse may introduce noise into the model update,but it can be proved theoretically that the gradient noise does not affect the convergence and correctness of the model training.The experiment results show that the proposed method can reduce the computatioin overhead by about 28% to 54%,and have limited impact on prediction accuracy,compared to traditional training strategy.In total,for automatic configuration of big data frameworks,this thesis firstly proposed a data-efficient method for model construction based on active data selection,it can build the performance model with less data collection overhead and achieve higher prediction accuracy.Thus,the performance model can provide more accurate feedback during the heuristic search for automatic configuration.Moreover,to improve the training efficiency for high-dimensioin and complex performance model,the traditional training strategies are improved from the aspects of model parameters and training examples,the proposed methods can accelerate the model training and have limited impact on the prediction accuracy of model. |