Font Size: a A A

Research On Imbalanced Dataset Classification

Posted on:2012-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:X N FanFull Text:PDF
GTID:2178330338992041Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the fields of data mining and machine learning, most classification algorithms are built based on the assumption that the dataset has a balanced distribution among all classes. Imbalanced datasets, however, abound in the people's real life and production. For example, research on imbalance datasets has important value in indentifying bank fraud. Traditional classic classification algorithms aim at maximizing the overall accuracy based on the assumption that the distribution among all classes is uniform, which is not the case for imbalanced datasets. Nowadays, all the methods proposed to solve the imbalance problem can be roughly divided into two groups: data-level approaches and algorithm-level approaches.Data-level approaches exert their influence on the imbalanced problem by modifying the class distribution among all classes. Previous researches have shown that for some base classification algorithms, a balanced dataset provides improved overall classification performance compared to an imbalanced dataset. These studies justified the use of data-level approaches for imbalanced learning. This paper first studies the over-sampling techniques'influence on imbalanced learning. After a survey of several most popular over-sampling techniques, we analyze the characteristics of over-sampling techniques by using large margin principle, based on which we further proposed a new over-sampling technique, MSYN. In order to decrease the influence of one nearest neighbor bias of MSYN, we further propose a method to approximately calculate the hypothesis margin for general classifiers. Then we extend the MSYN technique. The experimental study verifies the efficacy of the proposed over-sampling techniques.For the algorithm-level approaches, it is not appropriate to use the accuracy as the guideline to construct classifiers. The area under the ROC curve (AUC) is an effective alternative for guiding the construction of classifiers for imbalanced learning. Another work of this paper focuses on a linear model aiming at maximizing AUC, called MALC. After study this construction procedures, we proposed two modifies to enhance it. The empirical studies are carried out on a broad range of real world datasets and the proposed modifies have shown significant effects.
Keywords/Search Tags:imbalanced datasets, over-sampling, large margin principle, AUC
PDF Full Text Request
Related items