| With the increasing demand for obtaining useful information from massive data,big data processing systems have get more and more attention,such as Spark,Hadoop,and Storm.Developers provide many custom configuration options for these systems to ensure the efficient running of periodic jobs.While the diverse configuration provides high flexibility,it also brings a heavy burden as a huge configuration space consisting of hundreds of options needs to be explored.Choosing poorly will not only significantly degrades performance(e.g.,execution time),but may also lead to deterioration of other non-functional properties(e.g.,financial cost or reliability).The consequence is that users stick with poor default configurations to avoid abnormal system behavior caused by misconfigurations.However,automatically searching for the optimal configuration of big data processing systems is challenging,because people need to face a series of problems such as complex configuration relationships or multiple optimization objectives.To address these challenges,this thesis carries out a series of research work on the configuration optimization for big data processing systems,the main contents include the following three aspects:1.Configuration sampling guided by configuration distance and option frequency.As a starting point for configuration optimization,an initial set of configurations with sufficient information is essential.This paper proposes a new configuration sampling method based on configuration distance and option frequency for big data processing systems.First,the configuration representation with a two-dimensional array structure is designed to contain validity information and value information,and the constraint between configuration options is represented as feature models.Next,configuration distance-based sampling is applied to generate a set of valid configurations uniformly distributed in the configuration space.Then,the value range of each option is divided according to the frequency of the configuration options,and the value is randomly selected in each interval.The combined use of two strategies can generate a small,valid,and representative set of configurations to support subsequent configuration analysis and optimization.2.Light GBM-based configuration relational model.The key to finding the optimal configuration is to understand the relationship between configuration and various non-functional properties,which introduces the basic problem of configuration prediction.To solve this issue in a lightweight way,this paper proposes a black-box method for building configuration relational models.According to the operation logic of the big data processing system,the proposal represents the system into a series of sequentially executed operation stages.Then,the Light GBM algorithm is used to establish the configuration relational model of each stage.The model treats the configuration space as a whole and does not make any assumptions about the relationships between configurations.Considering that the real measurement of the configuration needs to be collected as a sample set during the modeling process,a dynamic training strategy based on residual rules is applied to reduce unnecessary measurement overhead.The configuration relational model takes any configuration as input and can output the corresponding non-functional property prediction.3.Multi-objective configuration optimization.This paper formally defines the configuration optimization of big data processing systems as a multi-objective problem,and proposes a new optimal configuration search algorithm.The algorithm takes the decomposition-based multi-objective evolutionary algorithm(MOEA/D)as the main framework,and combines various techniques to enhance the search process.First,the configuration set generated by Research Content 1 is used as the initial population to provide a good search point for the algorithm.Second,the Light GBM model proposed in Research Content 2 is applied to evaluate the candidate configurations that emerge during the evolution process,bypassing the excessive overhead of measuring the candidate configurations.Third,the genetic operator corresponding to configuration individuals is designed.Finally,for the possible invalid configurations,a method based on SAT solvers is proposed,which uses two different SAT solvers to repair or replace invalid configurations.In this way,the diversity between configurations can be enhanced while improving the search efficiency of the algorithm.The proposed algorithm aims to provide the end-user with a diverse,high-quality set of valid configurations. |