Font Size: a A A

Application Of Clustering Based Sampling Algorithms In Unbalanced Data Learning

Posted on:2020-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:C X LiFull Text:PDF
GTID:2428330578461336Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of data science,it is known that the classification problem of imbalanced data sets has become one of the important issues in the field of data analysis.In real life,it involves a variety of practical problems,such as credit card fraud detection,network attack identification and so on.In most cases,it's hard to see a fully balanced data set.And the cost of misclassification of minority data is usually higher than that of majority data.So how to improve the sampling algorithm to change the imbalanced data set into a balanced data set,which is of great practical significance.Many researchers have proposed two methods to deal with the classification of imbalanced data sets from two aspects: classification algorithm and imbalanced data set.In view of this problem,the contents are as follows:(1)This paper described the research background and significance of the classification of imbalanced data sets and the research status at home and abroad.Especially for Kmeans algorithm,SMOTE algorithm,k-Nearest Neighbor and Support Vector Machine.(2)In order to improve the effectiveness of the under-sampling algorithm,we designed an under-sampling method based on clustering,i.e.USCL(An under-sampling method based on clustering)method.The basic idea of the algorithm was to take different number of clusters,the majority samples in the training set were clustered for several times.Then the cluster centers were used to represent the majority class.Next,the cluster centers were combined with the minority samples into a number of new training sets.Then the training sets were used to train classifiers and eliminate the classifiers with false classification tendency.Finally,the remaining classifiers were used to vote on the classification results.The theoretical analysis and experimental results showed that the algorithm could improve the classification performance of imbalance data sets effectively.(3)By studying the advantages and disadvantages of SMOTE(Synthetic Minority Oversampling Technique)algorithm,we proposed an OVSCL(An over-sampling method based on clustering)method based on clustering.The basic idea of the algorithm: minority samples in the training set were divided into three categories,and then the clustering method was used for boundary samples.Taking different number of clusters,the cluster centers were used to represent boundary samples.According to the set newsampling rate and the basic principle of SMOTE,the new samples were synthesized.Next,the new samples were combined with the majority samples into a number of new training sets.Then the training sets were used to train classifiers,and classifiers were used to vote on the classification results.The theoretical analysis and experimental results showed that Borderline-SMOTE algorithm,Refined Borderline-SMOTE algorithm,OVSCLC(An over-sampling based on clustering without choices)algorithm and OVSCL algorithm in this paper achieved the goal of improving SMOTE algorithm.However,OVSCL method in this paper is beneficial to improve the classification accuracy of minority samples.
Keywords/Search Tags:imbalanced data set, clustering algorithm, under-sampling, over-sampling, simulation experiment
PDF Full Text Request
Related items