Research Of Imbalanced Data And Its Application

Posted on:2020-07-27

Degree:Master

Type:Thesis

Country:China

Candidate:X H Hao

Full Text:PDF

GTID:2417330590982848

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the development of information technology,data from all walks of life are exploding.In this situation,how to quickly and effectively extract valuable information and knowledge from the ocean of data has become one of the important problems that need to be solved by all walks of life,and imbalanced data has become one of the research hotspots and directions of experts and scholars because it is very common in real life.This paper takes default of credit card clients data set on UCI as an example.The sample size of normal customers(class 0)is 23364 in this data set,the sample size of default customers(class 1)is 6636,and the category ratio is about 3.5:1.If we establish a random forest model directly on the original data,the AUC value is 0.7195,and the recall rate of default customers is only 0.34.Therefore,we processes the data through the method of imbalanced data,the purpose is to improve the comprehensive evaluation index AUC value and the recall rate of default customers.The research content is as follows:(1)Data preprocessing,it includes missing value and outlier test,feature derivation,standardization,continuous data discretization,and selection of features according to the sample distribution of different categories of each feature and random forest feature sorting,etc;(2)Select the optimal method at the data level,sampling methods include undersampling,oversampling and mixed sampling.Undersampling can be divided into basic undersampling,cluster-based undersampling(this paper draws on the CUSBoost algorithm),and mixed sampling methods include SMOTEENN,SMOTE+Tomek links.In this paper,I try the above five methods,and establish the random forest model.Theexperimental results show that SMOTEENN method has the best effect,the AUC value is 0.7458,and the recall rate is 0.60;(3)Select the optimal method at the algorithm.LR,SVM,RF,XGBoost and LightGBM models are established based on SMOTEENN method,and the parameters of each model are adjusted according to experience and grid search.The experimental results show that the optimal model is the LightGBM algorithm based on SMOTEENN method,the AUC value is 0.7815,and the recall rate is 0.70.Compared with the initial effect,the AUC value increased by 0.062 and the recall rate increased by 0.36.

Keywords/Search Tags:

undersampling, oversampling, mixed sampling, XGBoost, LightGBM

PDF Full Text Request

Related items

1	Research On User Purchase Behavior Prediction Based On LightGBM
2	An Empirical Study On Data Sampling Of Unbalanced Classification
3	The Purchase Prediction Of Kidswant Users And The Analysis Of Important Characteristics Under Different Product Categories
4	Application Of Machine Learning In Credit Intelligence Assessment
5	Statistical Analysis And Prediction Of User Complaints In Internet Tourisl Enterprises
6	Exploratory Research Of Space Sampling Method In Grain Yield Survey
7	Comparative Study And Optimization Of Sampling Design For Postgraduate Thesis
8	Prediction Of Population Density In Key Areas Of Beijing
9	Research On Online Test Prediction Based On LightGBM Model
10	Sample Selection Method Research Based On Multivariate Spatial Sampling Design