Font Size: a A A

An Automatically Filter Algorithm For Imbalanced Data Sets Classification

Posted on:2012-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:W GongFull Text:PDF
GTID:2178330335464785Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Imbalanced data sets means that the numbers of samples from different categories largely differ. The imbalanced data set has been reported to hinder the classification performance of many machine-learning algorithms, especially to the minority class. On the other hand, the imbalanced data also significantly reduce the training data. However, In the real world, extremely imbalanced data sets (3-5% positive samples) are common for many applications, such as multimedia semantic classification, information retrieval and medical prediction. In addition, people are always much more care about the minority class, for example, the keyword related documents only take little part of all the documents in Information Retrieval application. However, the traditional machine learning algorithms have poor performance on classifying minority samples, therefore, the problem of Imbalanced data should be solved immediately.To resolve this problem, in this paper we propose an approach that automatically removes samples that have no or negative effects on classifying. After cutting these useless samples, the data sets will be rebalanced. Meanwhile, the performance of classifying will also be improved.To achieve this idea, we propose a novel automatic filter algorithm that can extract filter rules from training data. By using these rules, most easy-to-classify dominant-class samples in imbalanced training set will be eliminated automatically. As a result, the ratio of minority samples is increased significantly, making it more suitable for classification algorithms.In the experiments, we extract filter rulers first, and then use these rules balance the training data set. At last, classifier will be trained on the new data sets using SVM. Our experiments show that:Firstly, the rule-based filter idea is practical and efficient.Secondly, the rules extracted by filter algorithm can filter lots of majority samples and few minority samples, and then the date set will be more balanced than before.Thirdly, after the imbalanced data set being filtered by our filtering algorithm and classified by SVM method, the performance has been improved and the time cost decrease significantly.Fourthly, the classification performance is better by using automated filtering algorithm than the performance of using cost sensitive method. And training time cost is also smaller by using automated classifying algorithm.Finally, our filtering algorithm can be applied to a real world application'extract news images automatically', and the experiment results show the performance is good.
Keywords/Search Tags:Machine learning, Classification, Imbalanced data set, SVM
PDF Full Text Request
Related items