Font Size: a A A

Research Of Imbalanced Data Over-sampling Technique Based On Rough Set Theory

Posted on:2015-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2298330422483071Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The imbalanced dataset problem is an important research in machine learning. Inthe imbalanced dataset, the class distribution is imbalanced. Because we have fewminority class samples, when we use traditional machine learning methods to classifythe imbalanced dataset, the result will tend to majority class, so we have a bad accuracyon minority class. It is difficult to classify the minority class samples. To improve theaccuracy of minority class, researchers proposed different methods from different levels.Among them, it is widely used that the method which by changing the unbalance degreeto improve the accuracy of minority class. SMOTE is the most representativeover-sampling method. But SMOTE algorithm over-sample all the minority classsamples without distinguish. Although it can increase the accuracy of minority class bywidely over-sampling, it will decrease the accuracy of majority class. The syntheticsamples have overlapped the decision space of majority class. Therefore, it is necessaryto filter the minority class samples that need to be over-sampled and to study a moretargeted over-sampling method.The neighborhood rough set model was proposed by applying rough set theory toneighborhood system. Because the neighborhood rough set model is based on samplesand its radius, it can easily get the distribution of whole imbalanced dataset. If we canapply it on SMOTE algorithm, we can get a better performance over-sampling method.In this thesis, based on neighborhood rough set model, we carry out the research onover-sampling method based on neighborhood rough set model. First, we useneighborhood rough set model to part the imbalanced data set into two parts bycalculating the radius and its neighborhood of each sample: the minority class samplesbelong to boundary region; the majority class samples belong to positive region. Afterpartition, we use SMOTE algorithm to over-sample the minority class samples belongto boundary region meanwhile we compare the synthetic samples with majority classsamples belong to positive region, if the synthetic sample is belong to the neighborhoodof some majority class sample which belong to positive region, the synthetic must beabandoned, or we add the synthetic sample into train set. At last, we proposeNRSBoundary-SMOTE algorithm. Next, when we apply NRSBoundary-SMOTEalgorithm on large data set, it appears that the running time is too long and efficiency islow. So in this thesis, we use MapReduce programming paradigm to proposeParallel-NRSBoundary-SMOTE algorithm. The algorithm parallelizes the procedure of partition and oversampling, and it decreases the time complex and improves theefficiency on large data set.At last, we do our experiment respectively and analyze the result by comparing itwith other algorithm. The result shows that the over-sampling algorithm in this thesiscan have better synthetic samples. It also can process large dataset, decrease the runningtime.
Keywords/Search Tags:Imbalanced Dataset, SMOTE, Neighborhood Rough Set Model, Boundary, Parallel, MapReduce
PDF Full Text Request
Related items