Research On Over Sampling Algorithm Oriented To Subdivision Of Minority Class Samples In Imbalanced Data Set

Posted on:2017-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yang

Full Text:PDF

GTID:2348330509953999

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Classification is a very important research direction in data mining research. The traditional classification methods have been used to classify the data sets with balanced class distribution, which can get nice classification results. However, in the actual situation, the distribution of the data set is not balanced, that is, the number of a class in the data set is significantly more than those data belong to the other class. In the imbalanced data set, minority class samples distribute differently respect to the decision boundary, and the closer to the decision boundary the easier those samples are misclassified, and those samples are more valuable for the classifier. So, we propose an algorithm which is an over sampling algorithm oriented subdivision of the minority samples. Dividing the minority class samples into 3 subdivisions according to the difference of distribution of the minority class samples, use different methods to deal with different subdivisions, and make the data set balanced Reasonably. In this paper, we learn the classical over sampling algorithms, analysis and sum up its advantages and disadvantages, and propose the improved algorithm.1. Minority class samples distribute differently respect to the decision boundary and the traditional over sampling algorithms do not deal with it or process several parts of them. So, we divide the minority class samples in to 3 subdivisions named DANGER, AL_SAFE, SAFE, and use different methods to deal with samples in different subdivisions, make a rational use of all the minority class samples.2. In each minority class sample`s k nearest neighbors, if there are more minority class samples, and the degree of support is higher, and the selection probability is lower. Samples in the AL_SAFE are close to the decision boundary, and the number of them is not small. We use the roulette wheel selection to reduce the number of sampling in the area which own many minority class samples. And the new samples will be well-distributed. We call this algorithm SD-ISMOTE.3. the over sampling operation is only from the subdivision level, and the granularity is rough. The distribution of internal samples in each subdivision is not balanced. To deal with this situation, after the formation of the three subdivision, using the k-means algorithm in each subdivision to cluster those samples in to a plurality of clusters, make the over sampling operation from the cluster level, and then determine the sampling number in each segment of each cluster through the rational sampling quantity calculation method. In this way, we can make the distribution of those internal samples which belong to different subdivisions become balanced.4. When we process the samples in the AL_SAFE, the former method sampling only in the n-dimensional ball, and the distribution range can`t be closer to the decision boundary. So we try to enlarge the sampling random factor to make the distribution range of the new samples closer to the decision boundary. We call it SD-ISMOTE2.We obtain the data sets from the UCI to make the experience, and those data sets are often used to classification research, through the experimental results we can see that both SD-ISMOTE and SD-ISMOTE2 algorithm can achieve apparent improvements.

Keywords/Search Tags:

imbalanced data set, decision boundary, classification, subdivision of minority samples, roulette wheel selection

PDF Full Text Request

Related items

1	Research On Imbalanced Data Classification In Financial Field
2	Research On Classification Method Of High-dimensional Class-imbalanced Data Sets Base On SVM
3	Research Of Imbalanced Data Classification Based On The Minority Samples Recombination
4	Research On Chaotic Firefly Algorithm Based On Roulette Wheel Selection Strategy
5	Research On Imbalanced Data Classification Methods For Unsafe Samples
6	Research And Application Of Boundary Loss Function For Imbalanced Data Set
7	Research On Imbalanced Data Classification Based On Voronoi Diagram
8	Research On Methods For Imbalanced Data Classification
9	An Imbalanced Data Classification Algorithm Combining Clustering With Sampling Strategy
10	Research On Imbalanced Dataset Classification Algorithm Based On Sampling