Font Size: a A A

Research On XGBoost Performance Optimization Based On Imbalanced Data

Posted on:2020-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q S YueFull Text:PDF
GTID:2428330578455877Subject:Computer technology
Abstract/Summary:PDF Full Text Request
XGBoost algorithm has become a popular machine learning algorithm because of its many advantages since it came out in 2016.XGBoost has more parameters,which will cause complex parameter optimization problems.At present,XGBoost parameter optimization mostly uses empirical methods or exhaustive methods,but there are many shortcomings.Such as,empirical method relies too much on human experience,so it often takes much time and effort.Grid search tries all the parameter combinations one by one,which takes too much time.In practice,the method of reducing step size is often used,but it is easy to miss the optimal value.Random search randomly selected sample points in the search range,and the results are contingent and uncertain.In the imbalanced data,the traditional classification technology has the highest overall classification accuracy as the target,and the classification surface has a new bias phenomenon,and the classification result cannot reach the expected.Although XGBoost improves the classification accuracy of minority classes by setting the parameter scale_pos_weight to change the weight of minority classes in training,the number of minority classes in imbalanced data is usually small,so this method is not very effective.Therefore,based on the relevant theories,this paper studies from the optimization of XGBoost parameters and performance optimization in imbalanced data.(1)Parameter optimization is equivalent to black box operation,and the internal function structure and properties cannot be known.This paper proposes an XGBoost parameter optimization strategy based on Bayesian optimization.The strategy uses the XGBoost parameter as the input of the objective function,and the mean value of the evaluation function obtained by cross-validation is used as the output of the objective function.And update the posterior distribution of the objective function by continuously adding new samples.The optimal parameters are obtained when the posterior distribution approximates the real objective function.The strategy will refer to past parameter values and there is no limit to the search distance,so the better XGBoost parameters are obtained in a shorter time.The experimental results show that the strategy consumes a significant reduction in grid search time consumption,and can find better parameters than random search,which makes the time consumption and parameter optimization results reach a better balance.(2)For the XGBoost in imbalanced data classification problem,a new mixed sampling based XGBoost ensemble algorithm is proposed.The ensemble algorithm uses EasyEnsemble to randomly sample multiple subsets from majority classes samples and merge them with minority classes samples.Use Borderline-SMOTE2 to add a few minority classes samples.Through the above operations,a training subset of class balance is formed.XGBoost is trained as a base classifier on each training subset,which makes the difference between the base classifiers and the individual performance optimal.Integrate all XGBoost-based classifiers into the final classifier to improve classification performance in imbalanced data.The experimental results show that the performance of the ensemble algorithm is improved compared to XGBoost.
Keywords/Search Tags:XGBoost, Imbalanced Data, Mixed Sampling, Parameters Optimization, Bayesian Optimization
PDF Full Text Request
Related items