Font Size: a A A

Research Of Imbalanced Data And Its Application

Posted on:2020-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:X H HaoFull Text:PDF
GTID:2417330590982848Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of information technology,data from all walks of life are exploding.In this situation,how to quickly and effectively extract valuable information and knowledge from the ocean of data has become one of the important problems that need to be solved by all walks of life,and imbalanced data has become one of the research hotspots and directions of experts and scholars because it is very common in real life.This paper takes default of credit card clients data set on UCI as an example.The sample size of normal customers(class 0)is 23364 in this data set,the sample size of default customers(class 1)is 6636,and the category ratio is about 3.5:1.If we establish a random forest model directly on the original data,the AUC value is 0.7195,and the recall rate of default customers is only 0.34.Therefore,we processes the data through the method of imbalanced data,the purpose is to improve the comprehensive evaluation index AUC value and the recall rate of default customers.The research content is as follows:(1)Data preprocessing,it includes missing value and outlier test,feature derivation,standardization,continuous data discretization,and selection of features according to the sample distribution of different categories of each feature and random forest feature sorting,etc;(2)Select the optimal method at the data level,sampling methods include undersampling,oversampling and mixed sampling.Undersampling can be divided into basic undersampling,cluster-based undersampling(this paper draws on the CUSBoost algorithm),and mixed sampling methods include SMOTEENN,SMOTE+Tomek links.In this paper,I try the above five methods,and establish the random forest model.Theexperimental results show that SMOTEENN method has the best effect,the AUC value is 0.7458,and the recall rate is 0.60;(3)Select the optimal method at the algorithm.LR,SVM,RF,XGBoost and LightGBM models are established based on SMOTEENN method,and the parameters of each model are adjusted according to experience and grid search.The experimental results show that the optimal model is the LightGBM algorithm based on SMOTEENN method,the AUC value is 0.7815,and the recall rate is 0.70.Compared with the initial effect,the AUC value increased by 0.062 and the recall rate increased by 0.36.
Keywords/Search Tags:undersampling, oversampling, mixed sampling, XGBoost, LightGBM
PDF Full Text Request
Related items