Research On Imbalanced Dataset Classification

Posted on:2012-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:X N Fan

Full Text:PDF

GTID:2178330338992041

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the fields of data mining and machine learning, most classification algorithms are built based on the assumption that the dataset has a balanced distribution among all classes. Imbalanced datasets, however, abound in the people's real life and production. For example, research on imbalance datasets has important value in indentifying bank fraud. Traditional classic classification algorithms aim at maximizing the overall accuracy based on the assumption that the distribution among all classes is uniform, which is not the case for imbalanced datasets. Nowadays, all the methods proposed to solve the imbalance problem can be roughly divided into two groups: data-level approaches and algorithm-level approaches.Data-level approaches exert their influence on the imbalanced problem by modifying the class distribution among all classes. Previous researches have shown that for some base classification algorithms, a balanced dataset provides improved overall classification performance compared to an imbalanced dataset. These studies justified the use of data-level approaches for imbalanced learning. This paper first studies the over-sampling techniques'influence on imbalanced learning. After a survey of several most popular over-sampling techniques, we analyze the characteristics of over-sampling techniques by using large margin principle, based on which we further proposed a new over-sampling technique, MSYN. In order to decrease the influence of one nearest neighbor bias of MSYN, we further propose a method to approximately calculate the hypothesis margin for general classifiers. Then we extend the MSYN technique. The experimental study verifies the efficacy of the proposed over-sampling techniques.For the algorithm-level approaches, it is not appropriate to use the accuracy as the guideline to construct classifiers. The area under the ROC curve (AUC) is an effective alternative for guiding the construction of classifiers for imbalanced learning. Another work of this paper focuses on a linear model aiming at maximizing AUC, called MALC. After study this construction procedures, we proposed two modifies to enhance it. The empirical studies are carried out on a broad range of real world datasets and the proposed modifies have shown significant effects.

Keywords/Search Tags:

imbalanced datasets, over-sampling, large margin principle, AUC

PDF Full Text Request

Related items

1	Classification Of Imbalanced Data Based On Margin Distribution Boosting Algorithm
2	Research On Potential Home Broadband User Identification Problem With Large Scale Imbalanced Datasets
3	Research On Classification Method For Imbalanced Datasets
4	Weighted Core Vector Machine
5	The Application And Improvement Of SVM Algorithm In Imbalanced Datasets
6	Multiclass Imbalanced Learning in Ensembles through Selective Sampling
7	Application And Research On Clustering Algorithm In Large Scale Datasets
8	Camplaints Text Classification Research Of Imbalanced Data Sets
9	Research On Support Vector Machines For Imbalanced Datasets And Incremental Learning
10	Research On Binary Imbalanced Large Data Classification And Its Application