Research Of Imbalanced Data Over-sampling Technique Based On Rough Set Theory

Posted on:2015-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:H Li

Full Text:PDF

GTID:2298330422483071

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The imbalanced dataset problem is an important research in machine learning. Inthe imbalanced dataset, the class distribution is imbalanced. Because we have fewminority class samples, when we use traditional machine learning methods to classifythe imbalanced dataset, the result will tend to majority class, so we have a bad accuracyon minority class. It is difficult to classify the minority class samples. To improve theaccuracy of minority class, researchers proposed different methods from different levels.Among them, it is widely used that the method which by changing the unbalance degreeto improve the accuracy of minority class. SMOTE is the most representativeover-sampling method. But SMOTE algorithm over-sample all the minority classsamples without distinguish. Although it can increase the accuracy of minority class bywidely over-sampling, it will decrease the accuracy of majority class. The syntheticsamples have overlapped the decision space of majority class. Therefore, it is necessaryto filter the minority class samples that need to be over-sampled and to study a moretargeted over-sampling method.The neighborhood rough set model was proposed by applying rough set theory toneighborhood system. Because the neighborhood rough set model is based on samplesand its radius, it can easily get the distribution of whole imbalanced dataset. If we canapply it on SMOTE algorithm, we can get a better performance over-sampling method.In this thesis, based on neighborhood rough set model, we carry out the research onover-sampling method based on neighborhood rough set model. First, we useneighborhood rough set model to part the imbalanced data set into two parts bycalculating the radius and its neighborhood of each sample: the minority class samplesbelong to boundary region; the majority class samples belong to positive region. Afterpartition, we use SMOTE algorithm to over-sample the minority class samples belongto boundary region meanwhile we compare the synthetic samples with majority classsamples belong to positive region, if the synthetic sample is belong to the neighborhoodof some majority class sample which belong to positive region, the synthetic must beabandoned, or we add the synthetic sample into train set. At last, we proposeNRSBoundary-SMOTE algorithm. Next, when we apply NRSBoundary-SMOTEalgorithm on large data set, it appears that the running time is too long and efficiency islow. So in this thesis, we use MapReduce programming paradigm to proposeParallel-NRSBoundary-SMOTE algorithm. The algorithm parallelizes the procedure of partition and oversampling, and it decreases the time complex and improves theefficiency on large data set.At last, we do our experiment respectively and analyze the result by comparing itwith other algorithm. The result shows that the over-sampling algorithm in this thesiscan have better synthetic samples. It also can process large dataset, decrease the runningtime.

Keywords/Search Tags:

Imbalanced Dataset, SMOTE, Neighborhood Rough Set Model, Boundary, Parallel, MapReduce

PDF Full Text Request

Related items

1	Research And Application Of Imbalanced Dataset Classification Prediction Algorithm
2	Adaptive Classification Boundary And Double Thresholds Supervised Neighborhood Rough Set And Its Attribute Reduction
3	Granular Computing-oriented Dynamic Neighborhood Imbalanced Data Classification Algorithm
4	A Reseach For Imbalanced Data Classifi-cation Algorithm Based On Neighborhood Rough Set And Hypernetwork
5	Research On Uncertainty Measurement Method For Neighborhood Rough Set Model
6	Classification Learning Of Imbalanced Data Sets Based On Sampling Processing
7	Research On Model Extension And Algorithm Based On Neighborhood Rough Set
8	Research And Application Of Imbalanced Data Based On Support Vector Machine
9	Suboptimal Decision Table Reduction Algorithm Based On Neighborhood Rough Model
10	The Model Of ?-? Neighborhood Rough Sets And Its Applications