Font Size: a A A

Classification Learning Of Imbalanced Data Sets Based On Sampling Processing

Posted on:2018-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:X ShangFull Text:PDF
GTID:2428330548999954Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
In the information age,data classification is an important research topic,especially for the classification of imbalanced data sets.There are a lot of imbalanced data sets in real life,in the imbalanced data sets,the minority class samples as the number of less,the distribution is relatively rare,and a large number of majority class the sample surrounded,is facing a huge challenge in the classification process.In practical applica-tion,the minority class sample classification errors often costs more.So in the process of classification of imbalanced data sets,how to improve the classification performance of the minority class sample has important significance,also should get more attention.In the aspect of data processing,the over samples algorithm by synthetic minority class samples to achieve the balance of the number of data samples in the data sets.Random over sampling increases the number of minority class samples by simple replica-tion,and improves the classification performance of minority classes to some extent,but lesds to sample overlap and over-fitting.In 2002,Chawla et al.Proposed over-sampling synthetic minority technology SMOTE(Synthetic minority over-sampling technique),the basic idea of SMOTE algorithm:by searching the k nearest neighbor minority samples of minority class samples,several sampling in k nearest neighbor samples are selected at random according to the sampling rate,the linear interpolation in the minority class samples to synthetic minority class samples,improve the overlap and over fitting sam-ples.However,the SMOTE algorithm is the synthesis new samples of all the minority class samples,and the effect of the boundary samples on the classification performance is ignored.In view of the above,Borderline-SMOTE algorithm is proposed based on the SMOTE algorithm by Han et al.The basic idea:using only the minority class samples on the boundary of the synthesis of minority class samples,it can improve the classification accuracy of the minority class samples.But this method in the selection of boundary samples using k nearest neighbor rule,however k nearest neighbor is different affect the selection of boundary samples,there are some limitations.This paper propose a new method to select the boundary samples,DBSMOTE al-gorithm,and propose a new synthetic minority class sample rules.The basic idea of DBSMOTE algorithm:Firstly,calculate the distance between each minority class sample and majority class samples,and calculate the average distance.Secondly,the distance is less than the average distance of minority class samples were selected as the boundary samples.Thridly,using random rule synthetic minority class samples,Finally,the synthesis of new sample and original sample into an new sample set,and the k nearest neighbor classification algorithm is used to model the data.The experimental results show that the algorithm can effectively improve the classification performance of the minority class samples.Since the lack of samples in the data set,the over sampling method and the under sampling method are deficient,over sampling will make the sample data set over fitting,and under sampling method will lose many sample information,the combination method can effectively solve the two problems.Secondly,someone has studied the two kinds of sampling methods,the experimental results show the good effect.In this paper,we com-bine the over sampling Random-SMOTE algorithm with the under sampling algorithm.The experimental results show that the combined algorithm can effectively improve the classification performance of minority class samples.
Keywords/Search Tags:Imbalanced data sets, SMOTE-algorithm, Random-SMOTE algorithm, Borderline-SMOTE algorithm, combination algorithm
PDF Full Text Request
Related items