Study Of Classification Algorithm On Unbalanced Data Sets

Posted on:2013-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Sun

Full Text:PDF

GTID:2248330371469923

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Classification is an important research direction in the field of machine learning. Somesophisticated algorithms have been formed after years of development and succeed in practice.These traditional classification algorithms intend to maximize the overall prediction accuracy,and assume that the class distribution is basic balanced. However, in real world applications, weoften face the problem of imbalanced data sets where the instances of one class are fewer thanthat of other classes, which means that the class distribution is highly skewed. We often refer theminority and majority class as positive and negative class respectively. Traditional algorithmstend to show a strong bias toward the majority, since they aim to maximize the overall accuracy.For example, in some extreme cases, if a traditional classification algorithm predicts all theinstances as the majority class, it still gets a high accuracy of 100%, but it cannot recognize theminority class instances. However, in many cases, the accuracy of the minority class is muchmore important. Therefore, many studies have been discussed to tackle this demanding problemin the field of data mining.The approaches proposed are mainly focus on three aspects: data level, algorithm level andevaluation criterion. On data level, solutions are proposed to artificially balance the training setsby modifying the distribution of the data sets and the commonly used methods are known asunder-sampling and over-sampling respectively. On algorithm level, approaches are designed tomodify the learning algorithm, making it more sensitive to the minority class, such ascost-sensitive learning, integrated learning and so on. Accuracy represents the population of thecorrectly predicted examples, which is not an appropriate evaluation criterion in imbalanced datasets. We need more reasonable evaluation criteria, such as F-Measure、G-Mean.This paper discusses on both data-level and algorithm-level, and proposes four methods totackle this demanding problem. The main contributions of this paper are summarized as follows:(1) We use the crossover operator and mutation operator to generate some of the newminority class samples. The method employs Euclidean geometric distance between two samplesto evaluate the effectiveness of the newly generated minority class instances. The proposedmethod is applied on UCI data sets and experimental results indicate that our method is effectivein improving the classification accuracy of minority class.(2) The majority class is clustered into several groups by K-means algorithm. Then werandomly sample a certain number of instances from each group. Those sampled instancesalmost equal the minority class instances. At last, we combine the sampled majority class instances and all the minority class instances to train a base classifier. Final predictions areproduced by combining those classifiers. The instability of K-mean makes different results ofeach cluster.(3) Firstly, AdaBoost is used to process the imbalanced data sets in order to get the weightsof samples. Then, we use Bagging as the classifier, bootstrap is no longer used when sample themajority class, but we randomly select some of samples that have larger and smallerweights .Meanwhile, we should ensure that the number of the samples selected from the majorityclass equals the number of the minority class. At last, we combine the sampled majority classsamples and all the minority class samples as the training data set for a component classifier.(4) AdaBoost is used to process the imbalanced data sets in order to get the weights ofsamples. Instances that have lager weight are considered as the boundary data. Only thoseborderline instances are over-sampled, making balanced a data set to train a base classifier.

Keywords/Search Tags:

imbalance datasets, KNN algorithm, AdaBoost, Bagging, resampling

PDF Full Text Request

Related items

1	Research On Resampling Methods For Imbalance Data
2	Research And Application Of Ensemble Learning Based On Combined Resampling Methods
3	Classification Algorithm And Evaluation On Imbalanced Datasets
4	Research Of Ensemble Classification Methods For Class-imbalance And Cost-sensitive Datasets
5	Insurance Cross-selling Prediction Based On Imbalanced Data
6	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods
7	The Improvement Of The Weighting Method In AdaBoost
8	An Imbalanced Approach Towards Credit Card Fraud Detection Using Proximity Based Resampling And Classifier Ranking
9	Integrated Classifier Learning Algorithm
10	Study Of Efficient Feature Selection And Classification Methods For Gene Expression Microarray Datasets