Font Size: a A A

Designing Feature Selection And Classincation Methods For Classificationmethods For Imbalanced Learning And Cost-sensitive Learning Problems

Posted on:2014-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:R WangFull Text:PDF
GTID:1228330398972871Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Over the past few decades, data is getting more and more important in almost every aspect of human life. The demand of analyzing these data is therefore also in-creasing rapidly. As a major subject focusing on data analysis, machine learning have been well-developed in this process. In particular, one of the core problems in machine learning—classification problem is of great importance. Various theories, algorithms, and softwares have been systematically developed for different kinds of classification problems, and many of these techniques have been used in real world, result in many successes. However, new problems keep emerging when existing classification tech-niques are applied to new application problems. Hence, new researchers needs to be carried on to solving these new challenges.The first topic addressed in this thesis is called imbalanced leaning, which is a hot research topic in the machine learning community for the last decade. Generally speaking, imbalanced learning refers to the task of building a classifier with a dataset where the instances are skewed distributed over different classes. This kind of tasks are interesting because that many widely used classification algorithms will bias to classes that of many instances against classes that of only a few instances. In many cases, however, this bias is unappropriate. Therefore, the target of imbalanced learning is to build a classifier that of high recognition rates on classes of few instances, while keeping reasonably performance on classes of many instances. There are many real-world imbalanced learning problems, for instance, almost every abnormal detection problem can be a imbalanced learning problem, i.e., the number of normal instances are usually great outnumbered than abnormal instances. Therefore, investigating this topic in academical researches is of enormous practical significance. In most of the works on binary imbalanced learning, Area Under the receiver operating characteristic Curve (i.e., AUC) is adopted to measure the performance of classification systems. Hence, the investigation on binary imbalanced learning is focused on maximizing the AUC value of the classifier. Similarly, in multiple imbalanced learning problems, AUC was extended to MAUC, and the target of building a classifier in this case is to maximize the MAUC value. This thesis will focus on feature selection and classification algorithm design with regard to imbalanced learning. In summary, we design a new feature selection algo-rithm for binary imbalanced learning problem to maximize the AUC matric, and a new feature selection algorithm for multi-class imbalanced learning problem to maximize the MAUC matric. Specifically, in binary imbalanced learning, we propose to adopt the Spearman correlation coefficient to measure the redundancy between features, and then combine this redundancy term with the AUC term that measures the relevance between a feature and the class label. This results in our new feature selection method which can improve the AUC value of the classification system significantly compared to other feature selection methods. In multi-class imbalanced learning problem, we firstly analyze the drawbacks of traditional feature selection methods as well as a straightfor-ward method that using MAUC itself to measure utility of features. Then a new feature selection method is proposed by decomposing the multi-class problem into a batch of one-versus-one binary subproblems. Extensive experiments on8datasets with4differ-ent classification methods shows that our method can significantly improve the MAUC value of the classification system comparing to8other feature selection methods.Regarding to the classification step, we address the multi-class problem directly. we design a classification algorithm that aims to maximize the MAUC value of the clas-sifier. Technically, we firstly rewrite the formula of calculating the MAUC, and find that the calculation of MAUC value for a given output matrix can be carried out on each column independently. In consequence, the task of maximizing the MAUC value can be decomposed to a batch of independent binary subproblems. A further exploration reveals that each subproblem can be formulated as a bipartite ranking problem, which can be solved nicely by existing algorithm. Experimental results on16datasets con-firm that the proposed classification method in this thesis can significantly improve the MAUC value comparing to8other classification algorithms.Although the MAUC metric was used by several works in the literature, we also design a feature selection algorithm and a classification algorithm to maximize it, there is no direct evidence that justifies the usage of MAUC. Compared to using AUC in binary problems, it is not clear how to convert an output matrix to discrete class labels when the cost matrix is provided. For this reason, we firstly find out the best conversion method out of5methods. Based on this method, we further explore the correlation between MAUC and the total misclassification cost. The result confirms that MAUC is negatively correlated with total misclassification cost, which provides a experimental evidence for using MAUC as a performance metric. The other topic we studied in this thesis is cost-sensitive learning. It is closely related to imbalanced learning where the misclassification cost of a positive instance is usually higher than that of a negative instance. Unlike imbalanced learning, cost-sensitive learning takes the cost information into consideration explicitly, and aims to minimize the total misclassification cost directly. Since almost every practical clas-sification problem is cost-sensitive, researches on cost-sensitive learning problem is of great importance. In a traditional cost-sensitive learning study, one usually assumes that a fixed cost matrix is given along with the training dataset, and then a cost-sensitive classifier will be trained according to this cost matrix. In many real-world applications, however, the assumption is not validated. That is, the user may not be able to provided a fixed cost matrix due to various reasons. Therefore, no cost-sensitive classifier can be build. To tackle this difficulty, we proposed a new problem which aims to build a robust classifier with regard to uncertain cost matrices. In particular, the uncertain cost matrices is described with a set of cost matrices, and then we build robust classifiers according to the minimax decision criterion.
Keywords/Search Tags:Machine learning, Imbalanced leaning, Cost-sensitive learning, Featureselection, Classification algorithm design, Robust classification
PDF Full Text Request
Related items