Font Size: a A A

Application Of Cost-sensitive Learning Based On Re-sampling In Online Loan Users

Posted on:2020-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:B N GuoFull Text:PDF
GTID:2428330590460477Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Online loan data set is a typical unbalanced data set,which has the characteristics of large application volume and less approval.Using machine learning method to pre-screen the users who may give loans can greatly reduce the workload of subsequent manual audits and accelerate the reponse speed of loan users,which has good application value.In the last decades,many efforts have been performed to improved the classification performance towards the minority class.Two general approached are currently available to tackle the imbalanced data classification problems.One approach is based on data level,known as data sets reconstruction or re-sampling.Another approach is based on algorithms level aiming to modify the existing data mining algorithms.In this paper,the improved random balanced sampling algorithm is proposed at the data level,and an improved cost-sensitive decision tree based on ID3 algorithm is proposed at the algorithm level.Finally,the improved algorithm at the algorithm level and the data level is fused,and a new algorithm aiming at minimizing the total cost of misclassification is proposed.The proposed method is applied to the research of online user classification.The main efforts and conclusions of this thesis are listed below:1.Re-sampling algorithm of imbalanced data sets: Based on the random balanced sampling algorithm,this paper proposes an improved random balanced sampling algorithm.First,according to the location of the sample points,all the sample points are divided into three categories: security points,boundary points and noise points.Then,most of the samples in the noise points and boundary points are removed,so that the classification boundaries between different types of samples are clearer.Different sampling methods are used for different types of samples.At the same time,the majority of samples are undersampled and a few samples are over-sampled,so that the number of samples of different types in the sample set is basically the same.Compared with random balanced sampling algorithm,this algorithm improves the classification accuracy of a few samples in network loan classification.2.Cost-sensitive learning algorithm for unbalanced data sets: In this paper,class distribution is added to the calculation of cost-sensitive decision tree sensitivity function to reduce the impact of large difference between positive and negative class samples on the total cost of misclassification,and an improved cost-sensitive decision tree is constructed.As a base classifier,many models are trained.Each model predicts the original data set,expecting the minimum total cost of misclassification as the criterion to re-label the original data set,train the re-labeled data set,and get a new model.At the same time,the new model isintegrated with the base classifier with higher classification accuracy to get the final classifier.In network loan classification,compared with cost-sensitive decision tree algorithm,this algorithm can improve the overall classification accuracy and has stronger generalization ability.3.Most of the research on unbalanced data sets is pure resampling or pure cost-sensitive learning.Based on the fact that class imbalance and misclassification cost often occur simultaneously,this paper attempts to integrate reconstructed data sets with cost-sensitive learning.First,re-sampling method is used to reduce the imbalance of data sets,and then cost-sensitive learning algorithm is used to build the model.Compared with pure cost-sensitive learning algorithms,the classification accuracy of this algorithm is improved.
Keywords/Search Tags:imbalanced data sets, re-sampling, online loan, random balanced sampling, cost-sensitive learning
PDF Full Text Request
Related items