Research Of Imbalanced Data Classification Based On The Minority Samples Recombination

Posted on:2017-05-07

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2428330488479870

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Unbalanced data sets exist widely in real life.In this data set,the number of rare but often the minority class data is the focus of special attention.In the unbalance data,data of the minority class is usually surrounded by a large number of the majority sample,the traditional classification algorithms for unbalanced data set because of the number of categories and serious tilt and uneven distribution has become less applicable,although in the classification accuracy can reach a very high level,but far from achieved expected classification purposes.Sampling technique is one of the important research directions in the classification of unbalanced data sets.SMOTE algorithm is a particularly representative method in all the sampling techniques.This algorithm through a small number of internal samples between the use of interpolation method of synthesis of new small class samples,in order to make the balance between the class data.But SMOTE algorithm to generate new samples in the process still exist some problem such as the generation of new samples are likely to become noise,generate new samples are mostly located in the original small sample intensive areas,and this could not change the sparse distribution of minority class data.Therefore,in order to further improve the classification accuracy of the minority class,this paper proposes a new sampling method based on the BSMOTE called Distance Based SMOTE,in order to generate a new type of minority samples.The new algorithm DBSMOTE in generating new samples of the time and will not select the majority class samples as the source of data for synthesis of new samples,but only a selection of boundary samples with the most recent samples intermediate point as a new sample of synthetic sources.It not only broadens the classification boundary of the minority class,but also reduces the possibility of generating noise data.At the same time,in the synthesis of a new sample will be points in the synthesis of distance parameters being taken into account,to be synthesized point between two points separated by longer distance to obtain sample generation number,the shorter the distance of newly synthesized samples less chance,so in the classification algorithm can in the minority class in the sparse region boundary samples more attention,in order to balance the minority class internal data distribution is not balanced.Based on the experimental results of multiple unbalanced data sets,the proposed algorithm can effectively solve the problems existing in the SMOTE and BSMOTE algorithms,which can effectively improve the classification accuracy of the minority class samples.After disposed by DBSMOTE the unbalance data sets in the performance evaluation criteria F-measure and AUC on the overall performance of the best,and achieved good classification results.

Keywords/Search Tags:

Unbalanced data, Classification, SMOTE, Over-sampling

PDF Full Text Request

Related items

1	Research On Classification Method Of Imbalanced Data Set Based On Improved Sampling Strategy
2	Research On Outlier Detection For Unbalanced Data
3	Research And Application Of Classification Algorithm Based On Unbalanced Data
4	Research Of Imbalanced Data Classification Based On The Minority Samples Recombination
5	The Improvement And Application Of Smote Algorithm For Unbalanced Data Sampling
6	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
7	Unbalanced Data Sampling Based On Sample Prior Distribution Information
8	Research On SVM Classification Of Unbalanced Data And Its Application In Identify Poor Students In Colleges And Universities
9	Research On Employee Turnover Prediction Based On SMOTE-SVM Under Unbalanced Data
10	Improvement And Application Of SMOTE Algorithm