Font Size: a A A

An Undersampling Method Based On KAMILA Clustering And Elimination Of Redundancy

Posted on:2021-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:D Y ShenFull Text:PDF
GTID:2518306113969429Subject:Statistics
Abstract/Summary:PDF Full Text Request
In this paper,based on the problem of data imbalance in credit card application scoring,we propose a new undersampling method: UMBKER(an Undersampling Method Based on KAMILA and Elimination of Redundancy).This method can be used on mixed datasets which include both continuous variables and classified variables.This method can effectively remove redundant samples to reduce the data imbalance rate without changing the distribution characteristics of most samples and reduce the impact of unbalanced data on models.UMBKER algorithm is a combination of KAMILA clustering method and redundancy algorithm and is suitable for mixed data.In this method,KAMILA clustering is applied to the majority clusters in the data set,and then redundant samples are removed in each cluster.The specific method of removing redundant samples is to calculate the distance between two samples in each cluster,and to calculate the distance between each sample and the center of its cluster,and then multiply the two values to get a similar redundancy coefficient matrix.Select the minimum value in the similar redundancy coefficient matrix and find two samples corresponding to it,then delete one sample randomly and update the similar redundancy coefficient matrix.Repeat these steps until the stop condition is reached.After each cluster has finished the steps above,all the samples left are integrated into a new data set,which is the new undersampled data set.This method reduces the imbalance by about half per execution.If the original data set is highly unbalanced,the algorithm can be considered to be executed several times.In this paper,relevant experiments are carried out.By randomly generating mixed data set and comparing Kamila clustering with other existing algorithms for mixed data clustering,the superiority of KAMILA clustering is proved.For mixed data,KAMILA clustering is the best in both clustering accuracy and stability.In the data set named Abalone,the logical regression based on the dateset which is undersampled by UMBKER performes better than based on other methods.The results show that there is a certain improvement in F1 value and can keep AUC and some other values the same as before.This experiment proves that this method can remove some useless samples and keep most of the information.It can lower the unbalance rate and keep the recognition rate of the minority samples in the original dataset as well as before.This paper also makes an empirical test on the credit card data set.We systematically introduces the data processing,variable selection,dependent variable definition,etc.,and applies the UMBKER algorithm to the actual credit card scoring model.We undersample the majority samples,delete the redundant samples,and reduce the imbalance rate of the data set.The empirical results show that the logistic regression model trained on the dataset after several UMBKER undersampling performs better on F1 value,recall,and AUC.Compared with previous experimental results,it is found that UMBKER undersampling is more suitable for actual data with large sample size.Finally,the undersampled data is used to train the model,and the final model of credit card scoring is obtained.In summary,the UMBKER undersampling method can improve the performance of credit card scoring.
Keywords/Search Tags:Imbalanced Mixed Data, Credit Card Scoring, Undersampling, KAMILA Clustering, Remove Redundancy, Logistic Regression
PDF Full Text Request
Related items