Font Size: a A A

A Comparative Study Of Oversampling Techniques Based On Unbalanced Credit Data Sets

Posted on:2021-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:H M MaoFull Text:PDF
GTID:2518306113961879Subject:Economic big data analysis
Abstract/Summary:PDF Full Text Request
In the age of data inflation,a large number of credit cardbased transactions have taken place around the world,and fraud in online payment based on credit cards has increased dramatically.it prompting banks,credit institutions and e-commerce organizations to implement automated fraud detection systems and data mining of large numbers of transaction logs to distinguish fraudulent users from non fraudulent users.Machine learning seems to be one of the most effective solutions to uncovering illegal transactions,and machine learning algorithm models use properly trained binary classification systems from pre-filtered sample sets to distinguish between fraud and non-fraud instances.However,it is important to note that fraud detection datasets are highly unbalanced in nature.The imbalance of data classification distribution has a great impact on many classification algorithms,and common classification algorithms can't learn from unbalanced data effectively.Classifiers trained on unbalanced data sets tend to classify samples into most classes,and we are more interested in the few classes of fraudulent defaults when we conduct fraud detection,which greatly reduces the effectiveness of binary classifiers.Over the past few years,due to the simplicity and ease of implementation of oversampling techniques,it has often been used to mitigate data imbalances,but most existing oversampling method have not been able to offset imbalances within minorities,which is often a major problem with unevenly classified data sets.Based on this,based on SMOTE oversampling technology,this paper proposes a K-means SMOTE oversampling technique combined with SMOT oversampling to rebalance the skewed data set,which only avoids noise by sampling in a safe area.In addition,it focuses on imbalances between categories and imbalances within categories,and addresses small separations by expanding sparse minority areas.This paper selects LR,RF,SVC,and XGBoost as classifiers to verify the effectiveness of this sampling technique.This paper compares and analyzes the performance of K-means SMOTE,SMOTE,Balanced,Weighted,ADASYN,Oversampling,Undersampling and other oversampling techniques to train the data set on the selected classifier.The experimental results show that K-means smote sampling technique can improve the performance of the classifier.Although the K-means SMOTE technology has yielded good results,it may also increase false positives while reducing false positives,given that it does not take into account the overall distribution of the data set when oversampling and the classification of labels.Based on this,this paper introduces WCGAN oversampling technology to solve the classification imbalance of data sets.WCGAN considered the distribution of labels and the distribution of real data in the process of generating data,can generate more realistic data,and BY replacing the JS distance in GAN with EM distance,WCGAN can effectively avoid the pattern collapse and training instability during model training,and improve the quality of the generated data.In this paper,the accuracy of the model performance evaluation index is used as the final evaluation indicator in view of the balance of the classification of the new data set,which is the final evaluation index of the best classification performance XGBoost algorithm.The experimental results show that WCGAN over sampling technology is better than other over sampling methods.
Keywords/Search Tags:category imbalance, credit risk forecast, K-means SMOTE, WCGAN
PDF Full Text Request
Related items