Application Of Cost-sensitive Learning Based On Re-sampling In Online Loan Users

Posted on:2020-12-25

Degree:Master

Type:Thesis

Country:China

Candidate:B N Guo

Full Text:PDF

GTID:2428330590460477

Subject:Computational Mathematics

Abstract/Summary:

PDF Full Text Request

Online loan data set is a typical unbalanced data set,which has the characteristics of large application volume and less approval.Using machine learning method to pre-screen the users who may give loans can greatly reduce the workload of subsequent manual audits and accelerate the reponse speed of loan users,which has good application value.In the last decades,many efforts have been performed to improved the classification performance towards the minority class.Two general approached are currently available to tackle the imbalanced data classification problems.One approach is based on data level,known as data sets reconstruction or re-sampling.Another approach is based on algorithms level aiming to modify the existing data mining algorithms.In this paper,the improved random balanced sampling algorithm is proposed at the data level,and an improved cost-sensitive decision tree based on ID3 algorithm is proposed at the algorithm level.Finally,the improved algorithm at the algorithm level and the data level is fused,and a new algorithm aiming at minimizing the total cost of misclassification is proposed.The proposed method is applied to the research of online user classification.The main efforts and conclusions of this thesis are listed below:1.Re-sampling algorithm of imbalanced data sets: Based on the random balanced sampling algorithm,this paper proposes an improved random balanced sampling algorithm.First,according to the location of the sample points,all the sample points are divided into three categories: security points,boundary points and noise points.Then,most of the samples in the noise points and boundary points are removed,so that the classification boundaries between different types of samples are clearer.Different sampling methods are used for different types of samples.At the same time,the majority of samples are undersampled and a few samples are over-sampled,so that the number of samples of different types in the sample set is basically the same.Compared with random balanced sampling algorithm,this algorithm improves the classification accuracy of a few samples in network loan classification.2.Cost-sensitive learning algorithm for unbalanced data sets: In this paper,class distribution is added to the calculation of cost-sensitive decision tree sensitivity function to reduce the impact of large difference between positive and negative class samples on the total cost of misclassification,and an improved cost-sensitive decision tree is constructed.As a base classifier,many models are trained.Each model predicts the original data set,expecting the minimum total cost of misclassification as the criterion to re-label the original data set,train the re-labeled data set,and get a new model.At the same time,the new model isintegrated with the base classifier with higher classification accuracy to get the final classifier.In network loan classification,compared with cost-sensitive decision tree algorithm,this algorithm can improve the overall classification accuracy and has stronger generalization ability.3.Most of the research on unbalanced data sets is pure resampling or pure cost-sensitive learning.Based on the fact that class imbalance and misclassification cost often occur simultaneously,this paper attempts to integrate reconstructed data sets with cost-sensitive learning.First,re-sampling method is used to reduce the imbalance of data sets,and then cost-sensitive learning algorithm is used to build the model.Compared with pure cost-sensitive learning algorithms,the classification accuracy of this algorithm is improved.

Keywords/Search Tags:

imbalanced data sets, re-sampling, online loan, random balanced sampling, cost-sensitive learning

PDF Full Text Request

Related items

1	Research On Classification Method For Imbalanced Datasets
2	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
3	Imbalanced Data Classification And Its Application In The Prediction Of The Mobile Phone Replacement
4	Research On Imbalanced Data Classification Algorithms Based On Weight Analysis Of Loss Function
5	Imbalanced Learning And Its Application Based On Manifold Embedded Over-sampling
6	The Improved Random Forests Based On The Imbalanced Data Classification
7	The Research Of Imbalanced Data Classification
8	Hybrid Ensemble Learning For Imbalanced Data
9	Classification Learning Of Imbalanced Data Sets Based On Sampling Processing
10	An Adaptive Sampling Ensemble Classifier For Learning From Imbalanced Data Sets