Font Size: a A A

Classification Algorithm And Evaluation On Imbalanced Datasets

Posted on:2012-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:M F LiFull Text:PDF
GTID:2178330332490544Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification problem is one of the most important fields of machine learning; many classical algorithms have been used in practical problems and achieved good results. The traditional classification algorithms are based on the assumption that the data distribution is balance, and the main purpose of these algorithms is to improve the overall accuracy of the datasets. However, many datasets have highly skewed, imbalanced classes, in which the number of samples belong to one or more classes is much larger than the number of others. This type of datasets is called class imbalanced dataset. The classes which have larger number of samples are majority classes and others are minority classes. When the traditional classifiers classify the imbalanced datasets, the minority class only takes a small proportion of datasets, and the classifiers often focus on the overall accuracy of the datasets, ordinarily the accuracy for minority class is ignored. In some extreme cases, classifiers even misclassify all the minority samples in order to achieve high value of overall accuracy. But in many practical problems, the samples in minority class are much more important than that in majority class. How to improve the classification performance on imbalanced datasets becomes an important research direction in machine learning.Research directions on imbalanced datasets are composed as follows: First are the data-level methods which mostly are data preprocessing methods. In order to reduce the level of imbalance and balance the distribution of the datasets, the data-level methods often change the distribution of the original datasets. Sampling methods and feature extraction methods are commonly used. Next are the algorithm-level methods, because changing the data distribution will bring some negative impacts, many researches improve the traditional classification methods to adapt to the imbalanced datasets without changing the data distribution. It includes cost sensitive methods, threshold methods etc... The traditional performance evaluation criteria often ignore the accuracy for minority class, so traditional performance evaluation criteria for imbalanced datasets are also one of the focus in recent years, G-Mean, F-Measure are the criteria which are commonly used for imbalanced datasets.For the problems of imbalanced datasets, this dissertation has researched on both data-level and algorithm-level, and tries to define a new criterion to evaluate the performance of classifiers. The main contributions of this dissertation are summarized as follows:(1) Combination of data-level and algorithm-level: a new algorithm that combines the Bagging algorithm and SMOTE algorithm named BASM algorithm is proposed. First SMOTE algorithm is used to make new minority synthetic samples, then according to the classes and the accuracy, we adjust the weights of both samples and base classifiers using in Bagging algorithm. Usually, most algorithms used on imbalanced datasets only adapt to two-class imbalanced datasets, while this algorithm not only adapts to the two-class imbalanced datasets but also can be used on multi-class imbalanced datasets. The experiment shows that this algorithm can improve both the classification performance of the whole datasets and that of the minority part on two-class and multi-class imbalanced datasets.(2) Algorithm-level: a new threshold selection criterion is proposed in this dissertation. It can be proved that this criterion can make both the data in the minority and majority class reach optimal classification accuracy without the impact of the sample proportion. Using back-propagation algorithm (BP algorithm) as base classifier, searching the optimal threshold by genetic approach (GA) and the new threshold selection criterion, the result got from the experiment on five datasets is good. Based on the threshold selection criterion, a new criterion to evaluate the performance of classifier on imbalanced datasets is brought up. The experiment shows that this criterion focuses more on the error rate than other criterions do.(3) Data-level: SMOTE algorithm has a drawback that it can only adopt linear interpolations between two near samples to generate synthetic samples, so we proposed a new over-sampling algorithm: random walk over-sampling algorithm (RWO-Sampling). This algorithm can create synthetic samples for the minority class through randomly walking from the original data. It can be proved that based on some assumptions, RWO-Sampling generates samples obeying a probability distribution with mean and variance similar to that of the original minority data. The experiment shows that RWO-Sampling can statistically significantly outperform alternative methods in terms of the evaluation metrics on imbalanced data sets when implementing common baseline algorithms, such as C4.5, Naive Bayes(NB) and k Nearest Neighbor (KNN).
Keywords/Search Tags:classify, imbalanced datasets, BP algorithm, Bagging algorithm, threshold selection criterion
PDF Full Text Request
Related items