Font Size: a A A

Research On Over Sampling Algorithm Oriented To Subdivision Of Minority Class Samples In Imbalanced Data Set

Posted on:2017-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2348330509953999Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is a very important research direction in data mining research. The traditional classification methods have been used to classify the data sets with balanced class distribution, which can get nice classification results. However, in the actual situation, the distribution of the data set is not balanced, that is, the number of a class in the data set is significantly more than those data belong to the other class. In the imbalanced data set, minority class samples distribute differently respect to the decision boundary, and the closer to the decision boundary the easier those samples are misclassified, and those samples are more valuable for the classifier. So, we propose an algorithm which is an over sampling algorithm oriented subdivision of the minority samples. Dividing the minority class samples into 3 subdivisions according to the difference of distribution of the minority class samples, use different methods to deal with different subdivisions, and make the data set balanced Reasonably. In this paper, we learn the classical over sampling algorithms, analysis and sum up its advantages and disadvantages, and propose the improved algorithm.1. Minority class samples distribute differently respect to the decision boundary and the traditional over sampling algorithms do not deal with it or process several parts of them. So, we divide the minority class samples in to 3 subdivisions named DANGER, AL_SAFE, SAFE, and use different methods to deal with samples in different subdivisions, make a rational use of all the minority class samples.2. In each minority class sample`s k nearest neighbors, if there are more minority class samples, and the degree of support is higher, and the selection probability is lower. Samples in the AL_SAFE are close to the decision boundary, and the number of them is not small. We use the roulette wheel selection to reduce the number of sampling in the area which own many minority class samples. And the new samples will be well-distributed. We call this algorithm SD-ISMOTE.3. the over sampling operation is only from the subdivision level, and the granularity is rough. The distribution of internal samples in each subdivision is not balanced. To deal with this situation, after the formation of the three subdivision, using the k-means algorithm in each subdivision to cluster those samples in to a plurality of clusters, make the over sampling operation from the cluster level, and then determine the sampling number in each segment of each cluster through the rational sampling quantity calculation method. In this way, we can make the distribution of those internal samples which belong to different subdivisions become balanced.4. When we process the samples in the AL_SAFE, the former method sampling only in the n-dimensional ball, and the distribution range can`t be closer to the decision boundary. So we try to enlarge the sampling random factor to make the distribution range of the new samples closer to the decision boundary. We call it SD-ISMOTE2.We obtain the data sets from the UCI to make the experience, and those data sets are often used to classification research, through the experimental results we can see that both SD-ISMOTE and SD-ISMOTE2 algorithm can achieve apparent improvements.
Keywords/Search Tags:imbalanced data set, decision boundary, classification, subdivision of minority samples, roulette wheel selection
PDF Full Text Request
Related items