| Gradient boosting decision tree is an ensemble learning method with decision tree as a weak learner,which can be used for classification,regression and ranking prediction.Thanks to its advantages of easy training and high interpretability,the gradient boosting decision tree is widely used in various data analysis tasks such as spam detection,advertising,sales forecasting,and medical data analysis.However,in the face of largescale data sets,the training and inference process of the gradient lifting decision tree is extremely expensive in time and space,and there are still many challenges in optimizing the performance of the model.Therefore,this thesis mainly focuses on the performance optimization research of the gradient boosting decision tree method in training time,memory consumption and incremental learning.The main work of this thesis can be summarized as follows:(1)To tackle the problem of time-consuming and large memory consumption when dealing with larget scale data sets,research on multi-GPUs based GBDT performance optimization is carried out.We optimize the training processes of gradient and secondorder derivatives calculation,gain computation,and tree updating after data splitting.The optimization of inference in GBDT is also taken into account.In addition,based on our optimization of GBDT on multi-GPUs,we implement a fast GBDT system that can run on GPUs.(2)Aiming at addressing the shortcomings of GBDT that is prone to irregular memory access and not supporting incremental learning,a gradient boosting decision tree model framework that supports incremental learning is proposed.In this framework,users are allowed to integrate domain knowledge into the model.And he users can pre-design or select a suitable tree structure for their own machine learning tasks.In addition,our method can be trained in a parallel manner and incrementally.(3)A series of experiments were designed and carried out to analyze the performance of the proposed model and verify the effectiveness of the proposed model.For the multiGPUs gradient boosting decision tree optimization model,the gradient boosting decision tree optimization system is implemented in CUDA-C.The experiments are constructed on 8 real data sets and the experimental results show that compared to three popular GBDT systems,which are XGBoost,LightGBM and CatBoost,respectively,our multi-GPUs based GBDT system is 6-10 times faster than the GPU version of XGBoost,2.4-7.4 times faster than LightGBM,and 10.3 times faster than CatBoost.As for the proposed incremental gradient boosting decision tree model,comparative experiments are carried out on 7 real-world data sets on different topics.The empirical results show that the prediction error of the proposed incremental gradient boosting decision tree model is competitive to XGBoost while the training time is much less than XGBoost,which verifies the feasibility of this model.(4)The proposed incremental learning gradient boosting decision tree model is applied to aspect-level sentiment analysis tasks as a case study to prove the model’s usability for data analysis tasks.Furthermore,for the aspect-level sentiment analysis tasks,a two-step framework that can expand model capacity is carried out.Experimental results on SemEval 2014’s two datasets show that the proposed model is superior to the method based on SVM with hand-craft features in prediction accuracy.In addition,compared with the neural network-based methods,the proposed model achieved a new state-of-the-art effect on the laptop data set and achieved the competitive performance on the restaurant data set. |