The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets

Posted on:2010-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Dong

Full Text:PDF

GTID:2178360302460387

Subject:Information management and e-government

Abstract/Summary:

PDF Full Text Request

Imbalanced data sets are very common in our daily life. For imbalanced data sets, it's always the case that the identification of minority class is our interest. In imbalanced data sets, examples of minority class are sparsely distributed and always surrounded by a large amount of examples of majority class, which presents a great challenge for learning from minority class. Traditional classification algorithms perform poorly on imbalanced data sets, which tend to misclassify examples of minority class and can't achieve the objective of classification.SMOTE is a novel over-sampling method, which generates synthetic examples for minority class. After SMOTE, it's still intensive where it's intensive and still sparse where it's sparse in the sample space. So we can conclude that SMOTE can't predict well for unknown examples which fall in the sparse area of sample space, and hence there's still some room for the improvement of SMOTE. Enlightened by SMOTE, a new over-sampling method, Random-SMOTE is proposed in the paper.Via Random-SMOTE, new synthetic examples can be randomly generated in the sample space of minority class, which can change the situation of sparseness effectively. Random-SMOTE can deal with not only numerical attributes but also non-numerical attributes. Based on Random-SMOTE, a classification model of imbalanced data sets is proposed, which is a comprehensive solution for the classification of imbalance data sets. This model combines Random-SMOTE sampling technique with k-nearest neighbor classification algorithm. To deal with data sets with heterogeneous attributes, HEOM metric is applied in k-nearest neighbor algorithm. Data preparation and the selection of performance evaluation metrics are also included in the model.According to a series of experiments on many real data sets, we can conclude that Random-SMOTE is good at dealing with class imbalance problems. Compared with other sampling techniques, such as SMOTE, random over sampling and random under sampling, Random-SMOTE not only predict better for minority class but is also less sensitive to the absolute rarity of minority class and perform best generally as far as the metric G-mean is concerned. Also, the suggested setting of the unique varying parameter in Random-SMOTE, oversampling rate N, is given.As Random-SMOTE can deal with both numerical and non-numerical datasets, and is less sensitive to the absolute rarity of minority class, it is very robust and can be applied to many real-life problems.

Keywords/Search Tags:

Imbalanced data sets, Classification, SMOTE, Random-SMOTE

PDF Full Text Request

Related items

1	The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets
2	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
3	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
4	Research On The Classification Of Imbalanced Data Sets And Related Problems
5	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
6	The Research Of Web Pages Filtering Based On Random Forests Algorithms
7	Research And Application Of Classification Technology For Unbalanced Data
8	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm
9	Research And Application Of Imbalanced Data Classification
10	An Imbalanced Data Classification Based On Improved SVM