An Undersampling Method Based On KAMILA Clustering And Elimination Of Redundancy

Posted on:2021-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:D Y Shen

Full Text:PDF

GTID:2518306113969429

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

In this paper,based on the problem of data imbalance in credit card application scoring,we propose a new undersampling method: UMBKER(an Undersampling Method Based on KAMILA and Elimination of Redundancy).This method can be used on mixed datasets which include both continuous variables and classified variables.This method can effectively remove redundant samples to reduce the data imbalance rate without changing the distribution characteristics of most samples and reduce the impact of unbalanced data on models.UMBKER algorithm is a combination of KAMILA clustering method and redundancy algorithm and is suitable for mixed data.In this method,KAMILA clustering is applied to the majority clusters in the data set,and then redundant samples are removed in each cluster.The specific method of removing redundant samples is to calculate the distance between two samples in each cluster,and to calculate the distance between each sample and the center of its cluster,and then multiply the two values to get a similar redundancy coefficient matrix.Select the minimum value in the similar redundancy coefficient matrix and find two samples corresponding to it,then delete one sample randomly and update the similar redundancy coefficient matrix.Repeat these steps until the stop condition is reached.After each cluster has finished the steps above,all the samples left are integrated into a new data set,which is the new undersampled data set.This method reduces the imbalance by about half per execution.If the original data set is highly unbalanced,the algorithm can be considered to be executed several times.In this paper,relevant experiments are carried out.By randomly generating mixed data set and comparing Kamila clustering with other existing algorithms for mixed data clustering,the superiority of KAMILA clustering is proved.For mixed data,KAMILA clustering is the best in both clustering accuracy and stability.In the data set named Abalone,the logical regression based on the dateset which is undersampled by UMBKER performes better than based on other methods.The results show that there is a certain improvement in F1 value and can keep AUC and some other values the same as before.This experiment proves that this method can remove some useless samples and keep most of the information.It can lower the unbalance rate and keep the recognition rate of the minority samples in the original dataset as well as before.This paper also makes an empirical test on the credit card data set.We systematically introduces the data processing,variable selection,dependent variable definition,etc.,and applies the UMBKER algorithm to the actual credit card scoring model.We undersample the majority samples,delete the redundant samples,and reduce the imbalance rate of the data set.The empirical results show that the logistic regression model trained on the dataset after several UMBKER undersampling performs better on F1 value,recall,and AUC.Compared with previous experimental results,it is found that UMBKER undersampling is more suitable for actual data with large sample size.Finally,the undersampled data is used to train the model,and the final model of credit card scoring is obtained.In summary,the UMBKER undersampling method can improve the performance of credit card scoring.

Keywords/Search Tags:

Imbalanced Mixed Data, Credit Card Scoring, Undersampling, KAMILA Clustering, Remove Redundancy, Logistic Regression

PDF Full Text Request

Related items

1	Retail Baking Credit Scoring Model Development And Implemetation
2	Research On Ensemble Credit Scoring Model For Imbalanced Data
3	Study On Application Credit Scorecards In The Retail Bank Based On Data Mining Technology
4	Research On Credit Scoring Model Based On Machine Learning
5	Design And Implementation Of Heze Development Zone Agricultural Bank Credit Card Management System
6	Research And Application Of Data Mining In Cross-marketing Credit Card Of Commercial Bank Debit Card
7	The Application Of Data Mining Methods In Credit Card Default Prediction
8	Kernel Logistic Regression For Imbalanced Data Classification
9	Research On Logistic Regression Learning Algorithm For Imbalanced Problem
10	Design And Implementation Of Credit Scoring System