Application Of Clustering Based Sampling Algorithms In Unbalanced Data Learning

Posted on:2020-12-27

Degree:Master

Type:Thesis

Country:China

Candidate:C X Li

Full Text:PDF

GTID:2428330578461336

Subject:Computational Mathematics

Abstract/Summary:

PDF Full Text Request

With the rapid development of data science,it is known that the classification problem of imbalanced data sets has become one of the important issues in the field of data analysis.In real life,it involves a variety of practical problems,such as credit card fraud detection,network attack identification and so on.In most cases,it's hard to see a fully balanced data set.And the cost of misclassification of minority data is usually higher than that of majority data.So how to improve the sampling algorithm to change the imbalanced data set into a balanced data set,which is of great practical significance.Many researchers have proposed two methods to deal with the classification of imbalanced data sets from two aspects: classification algorithm and imbalanced data set.In view of this problem,the contents are as follows:(1)This paper described the research background and significance of the classification of imbalanced data sets and the research status at home and abroad.Especially for Kmeans algorithm,SMOTE algorithm,k-Nearest Neighbor and Support Vector Machine.(2)In order to improve the effectiveness of the under-sampling algorithm,we designed an under-sampling method based on clustering,i.e.USCL(An under-sampling method based on clustering)method.The basic idea of the algorithm was to take different number of clusters,the majority samples in the training set were clustered for several times.Then the cluster centers were used to represent the majority class.Next,the cluster centers were combined with the minority samples into a number of new training sets.Then the training sets were used to train classifiers and eliminate the classifiers with false classification tendency.Finally,the remaining classifiers were used to vote on the classification results.The theoretical analysis and experimental results showed that the algorithm could improve the classification performance of imbalance data sets effectively.(3)By studying the advantages and disadvantages of SMOTE(Synthetic Minority Oversampling Technique)algorithm,we proposed an OVSCL(An over-sampling method based on clustering)method based on clustering.The basic idea of the algorithm: minority samples in the training set were divided into three categories,and then the clustering method was used for boundary samples.Taking different number of clusters,the cluster centers were used to represent boundary samples.According to the set newsampling rate and the basic principle of SMOTE,the new samples were synthesized.Next,the new samples were combined with the majority samples into a number of new training sets.Then the training sets were used to train classifiers,and classifiers were used to vote on the classification results.The theoretical analysis and experimental results showed that Borderline-SMOTE algorithm,Refined Borderline-SMOTE algorithm,OVSCLC(An over-sampling based on clustering without choices)algorithm and OVSCL algorithm in this paper achieved the goal of improving SMOTE algorithm.However,OVSCL method in this paper is beneficial to improve the classification accuracy of minority samples.

Keywords/Search Tags:

imbalanced data set, clustering algorithm, under-sampling, over-sampling, simulation experiment

PDF Full Text Request

Related items

1	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
2	Data Distribution-driven Adaptive Hybrid Sampling Method For Imbalanced Data Processing
3	Imbalanced Classification Algorithm Based On Clustering Ensemble Under-Sampling
4	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
5	An Imbalanced Data Classification Algorithm Combining Clustering With Sampling Strategy
6	Research On Hybrid Sampling Algorithm Under Denoising In Imbalanced Classification
7	Research On Hybrid Sampling Of Imbalanced Data Based On Data Distribution
8	Research On Imbalanced Data Sampling Methods For Text Sentiment Classification
9	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
10	Research On The Re-sampling Technology Of Data Mining For High-dimensional Imbalanced Dataset