Font Size: a A A

Study Of Classification Algorithm On Unbalanced Data Sets

Posted on:2013-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:X Y SunFull Text:PDF
GTID:2248330371469923Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is an important research direction in the field of machine learning. Somesophisticated algorithms have been formed after years of development and succeed in practice.These traditional classification algorithms intend to maximize the overall prediction accuracy,and assume that the class distribution is basic balanced. However, in real world applications, weoften face the problem of imbalanced data sets where the instances of one class are fewer thanthat of other classes, which means that the class distribution is highly skewed. We often refer theminority and majority class as positive and negative class respectively. Traditional algorithmstend to show a strong bias toward the majority, since they aim to maximize the overall accuracy.For example, in some extreme cases, if a traditional classification algorithm predicts all theinstances as the majority class, it still gets a high accuracy of 100%, but it cannot recognize theminority class instances. However, in many cases, the accuracy of the minority class is muchmore important. Therefore, many studies have been discussed to tackle this demanding problemin the field of data mining.The approaches proposed are mainly focus on three aspects: data level, algorithm level andevaluation criterion. On data level, solutions are proposed to artificially balance the training setsby modifying the distribution of the data sets and the commonly used methods are known asunder-sampling and over-sampling respectively. On algorithm level, approaches are designed tomodify the learning algorithm, making it more sensitive to the minority class, such ascost-sensitive learning, integrated learning and so on. Accuracy represents the population of thecorrectly predicted examples, which is not an appropriate evaluation criterion in imbalanced datasets. We need more reasonable evaluation criteria, such as F-Measure态G-Mean.This paper discusses on both data-level and algorithm-level, and proposes four methods totackle this demanding problem. The main contributions of this paper are summarized as follows:(1) We use the crossover operator and mutation operator to generate some of the newminority class samples. The method employs Euclidean geometric distance between two samplesto evaluate the effectiveness of the newly generated minority class instances. The proposedmethod is applied on UCI data sets and experimental results indicate that our method is effectivein improving the classification accuracy of minority class.(2) The majority class is clustered into several groups by K-means algorithm. Then werandomly sample a certain number of instances from each group. Those sampled instancesalmost equal the minority class instances. At last, we combine the sampled majority class instances and all the minority class instances to train a base classifier. Final predictions areproduced by combining those classifiers. The instability of K-mean makes different results ofeach cluster.(3) Firstly, AdaBoost is used to process the imbalanced data sets in order to get the weightsof samples. Then, we use Bagging as the classifier, bootstrap is no longer used when sample themajority class, but we randomly select some of samples that have larger and smallerweights .Meanwhile, we should ensure that the number of the samples selected from the majorityclass equals the number of the minority class. At last, we combine the sampled majority classsamples and all the minority class samples as the training data set for a component classifier.(4) AdaBoost is used to process the imbalanced data sets in order to get the weights ofsamples. Instances that have lager weight are considered as the boundary data. Only thoseborderline instances are over-sampled, making balanced a data set to train a base classifier.
Keywords/Search Tags:imbalance datasets, KNN algorithm, AdaBoost, Bagging, resampling
PDF Full Text Request
Related items