Font Size: a A A

Research Of Imbalanced Data Classification Based On The Minority Samples Recombination

Posted on:2017-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2428330488479870Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Unbalanced data sets exist widely in real life.In this data set,the number of rare but often the minority class data is the focus of special attention.In the unbalance data,data of the minority class is usually surrounded by a large number of the majority sample,the traditional classification algorithms for unbalanced data set because of the number of categories and serious tilt and uneven distribution has become less applicable,although in the classification accuracy can reach a very high level,but far from achieved expected classification purposes.Sampling technique is one of the important research directions in the classification of unbalanced data sets.SMOTE algorithm is a particularly representative method in all the sampling techniques.This algorithm through a small number of internal samples between the use of interpolation method of synthesis of new small class samples,in order to make the balance between the class data.But SMOTE algorithm to generate new samples in the process still exist some problem such as the generation of new samples are likely to become noise,generate new samples are mostly located in the original small sample intensive areas,and this could not change the sparse distribution of minority class data.Therefore,in order to further improve the classification accuracy of the minority class,this paper proposes a new sampling method based on the BSMOTE called Distance Based SMOTE,in order to generate a new type of minority samples.The new algorithm DBSMOTE in generating new samples of the time and will not select the majority class samples as the source of data for synthesis of new samples,but only a selection of boundary samples with the most recent samples intermediate point as a new sample of synthetic sources.It not only broadens the classification boundary of the minority class,but also reduces the possibility of generating noise data.At the same time,in the synthesis of a new sample will be points in the synthesis of distance parameters being taken into account,to be synthesized point between two points separated by longer distance to obtain sample generation number,the shorter the distance of newly synthesized samples less chance,so in the classification algorithm can in the minority class in the sparse region boundary samples more attention,in order to balance the minority class internal data distribution is not balanced.Based on the experimental results of multiple unbalanced data sets,the proposed algorithm can effectively solve the problems existing in the SMOTE and BSMOTE algorithms,which can effectively improve the classification accuracy of the minority class samples.After disposed by DBSMOTE the unbalance data sets in the performance evaluation criteria F-measure and AUC on the overall performance of the best,and achieved good classification results.
Keywords/Search Tags:Unbalanced data, Classification, SMOTE, Over-sampling
PDF Full Text Request
Related items