An Automatically Filter Algorithm For Imbalanced Data Sets Classification

Posted on:2012-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:W Gong

Full Text:PDF

GTID:2178330335464785

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Imbalanced data sets means that the numbers of samples from different categories largely differ. The imbalanced data set has been reported to hinder the classification performance of many machine-learning algorithms, especially to the minority class. On the other hand, the imbalanced data also significantly reduce the training data. However, In the real world, extremely imbalanced data sets (3-5% positive samples) are common for many applications, such as multimedia semantic classification, information retrieval and medical prediction. In addition, people are always much more care about the minority class, for example, the keyword related documents only take little part of all the documents in Information Retrieval application. However, the traditional machine learning algorithms have poor performance on classifying minority samples, therefore, the problem of Imbalanced data should be solved immediately.To resolve this problem, in this paper we propose an approach that automatically removes samples that have no or negative effects on classifying. After cutting these useless samples, the data sets will be rebalanced. Meanwhile, the performance of classifying will also be improved.To achieve this idea, we propose a novel automatic filter algorithm that can extract filter rules from training data. By using these rules, most easy-to-classify dominant-class samples in imbalanced training set will be eliminated automatically. As a result, the ratio of minority samples is increased significantly, making it more suitable for classification algorithms.In the experiments, we extract filter rulers first, and then use these rules balance the training data set. At last, classifier will be trained on the new data sets using SVM. Our experiments show that:Firstly, the rule-based filter idea is practical and efficient.Secondly, the rules extracted by filter algorithm can filter lots of majority samples and few minority samples, and then the date set will be more balanced than before.Thirdly, after the imbalanced data set being filtered by our filtering algorithm and classified by SVM method, the performance has been improved and the time cost decrease significantly.Fourthly, the classification performance is better by using automated filtering algorithm than the performance of using cost sensitive method. And training time cost is also smaller by using automated classifying algorithm.Finally, our filtering algorithm can be applied to a real world application'extract news images automatically', and the experiment results show the performance is good.

Keywords/Search Tags:

Machine learning, Classification, Imbalanced data set, SVM

PDF Full Text Request

Related items

1	Research On Classification Algorithms For Imbalanced Dataset
2	An Automatically Filter Algorithm For Imbalanced Data Sets Classification
3	Research On Imbalanced Data Augmentation And Imbalanced Classification Based On Auto-Encoder
4	Research On Imbalanced Data Classification Algorithm Based On Extreme Learning Machine
5	Research On Classification Methods Based On Extreme Learning Machine
6	Research On Extreme Learning Machine For Online Sequential Imbalanced Data Classification
7	Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method
8	Imbalanced Classification Methods Based On Extreme Learning Machine And The Application
9	Research On The Imbalanced Data Learning
10	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE