Font Size: a A A

The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets

Posted on:2010-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y J DongFull Text:PDF
GTID:2178360302460387Subject:Information management and e-government
Abstract/Summary:PDF Full Text Request
Imbalanced data sets are very common in our daily life. For imbalanced data sets, it's always the case that the identification of minority class is our interest. In imbalanced data sets, examples of minority class are sparsely distributed and always surrounded by a large amount of examples of majority class, which presents a great challenge for learning from minority class. Traditional classification algorithms perform poorly on imbalanced data sets, which tend to misclassify examples of minority class and can't achieve the objective of classification.SMOTE is a novel over-sampling method, which generates synthetic examples for minority class. After SMOTE, it's still intensive where it's intensive and still sparse where it's sparse in the sample space. So we can conclude that SMOTE can't predict well for unknown examples which fall in the sparse area of sample space, and hence there's still some room for the improvement of SMOTE. Enlightened by SMOTE, a new over-sampling method, Random-SMOTE is proposed in the paper.Via Random-SMOTE, new synthetic examples can be randomly generated in the sample space of minority class, which can change the situation of sparseness effectively. Random-SMOTE can deal with not only numerical attributes but also non-numerical attributes. Based on Random-SMOTE, a classification model of imbalanced data sets is proposed, which is a comprehensive solution for the classification of imbalance data sets. This model combines Random-SMOTE sampling technique with k-nearest neighbor classification algorithm. To deal with data sets with heterogeneous attributes, HEOM metric is applied in k-nearest neighbor algorithm. Data preparation and the selection of performance evaluation metrics are also included in the model.According to a series of experiments on many real data sets, we can conclude that Random-SMOTE is good at dealing with class imbalance problems. Compared with other sampling techniques, such as SMOTE, random over sampling and random under sampling, Random-SMOTE not only predict better for minority class but is also less sensitive to the absolute rarity of minority class and perform best generally as far as the metric G-mean is concerned. Also, the suggested setting of the unique varying parameter in Random-SMOTE, oversampling rate N, is given.As Random-SMOTE can deal with both numerical and non-numerical datasets, and is less sensitive to the absolute rarity of minority class, it is very robust and can be applied to many real-life problems.
Keywords/Search Tags:Imbalanced data sets, Classification, SMOTE, Random-SMOTE
PDF Full Text Request
Related items