Font Size: a A A

Imbalanced Data Learning Based On Kernel Methods

Posted on:2010-07-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y LinFull Text:PDF
GTID:1118360302973767Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Imbalanced data learning (IDL), which has got people to pay intensive attention in recent years, is a special kind of supervised (classification) learning. The main goal of IDL is to handle the classification problems with unevenly-distributed training examples between classes, i.e., the so-called class imbalance problems (CIPs). CIPs exist in many important real-world domains, including medical diagnosis and intrusion detection etc. Most of the existing classification algorithms are designed based on balanced class-distribution and classification-accuracy maximization; when applied to CIPs, they often"over-learn"majority class and further degrade the overall performance of trained classifiers. Objectively speaking, CIPs have posed an enormous challenge for the current machine learning research community.Focusing on how to deal with CIPs reasonably and effectively, via the newly-developed kernel methods, especially via support vector machine (SVM), we have carried out a series of related research work, which are summarized as follows:(1) Study on a basic issue of IDL, that is, how to evaluate classifier performance reasonably. We firstly summarize and analyze a set of evaluation metrics, which are frequently used in the current machine learning fields. In particular, the reason that traditional accuracy doesn't suit for IDL is appropriately explored from theoretical aspect. Then, by using meta-learning method, we experimentally study the performance differences between SVM classifiers, which are optimized under different metrics. The experimental results show that although SVM is a state-of-the-art method, but the classifiers constructed by SVM are still readily biased to the majority when they are optimized under accuracy. Whereas, when optimizing under other more reasonable metrics, we can obtain"bias-rectified"SVM classifiers, which have better overall performance. The results obtained in this part not only exposit the distinction among different evaluation metrics, but also supply beneficial enlightenment for SVM's model selection.(2) Study on how to apply several extended SVMs to CIPs by way of weighting the training examples asymmetrically. With least square SVM and proximal SVM as representatives, some extended SVMs have also been used as extensively as the standard SVM due to their easily-resolution and good performance. However, if these extended SVMs are implemented in IDL directly, we usually can't obtain satisfying results. To improve their efficacy, one of the most simple and practical ways is to weight the training example asymmetrically. A new weighting strategy is proposed in this dissertation to overcome the deficiencies of some existing weighting methods, which assigns more weights to majority-class examples than to minority-class examples, and tries to decrease the weights of abnormal examples as well. The weighting strategies can be easily embedded in the extended SVMs. Based on 15 benchmark datasets, we have conducted the numerical experiments to compare the performance of different combinations of extended SVMs and weighting mechanisms. The experimental results show that our new weighting strategy has significant performance advantages over other strategies in some cases.(3) Enlightened by the margin-maximization and structural risk control training principles of the standard SVM, we have proposed a new model for training kernel classifier with large margin, and this is one significant innovation of this dissertation. The proposed model has intuitive geometrical meaning; and more importantly, it emphasizes on optimizing classifier's generalization capacity. The original optimization form of new model is non-convex and it is difficult to be handled. But, after appropriate relaxation, the original model can be transformed into two different and easily-resolved second order cone programming (SOCP) formulations. With the help of SeDuMi, a kind of freely-used optimization toolbox, we have conducted the numerical experiments on 12 benchmark datasets. The experimental results demonstrate that no matter for dealing with balanced datasets or unbalanced datasets the new SOCP models both outperform the standard SVM significantly in some cases; furthermore, one SOCP model shows relatively higher robustness than the standard SVM.(4) In view that under-sampling technique may suffer from training examples'information loss, we propose to combine it with ensemble learning to enhance the efficacy of SVM on CIPs. Bagging and AdaBoost are utilized as the ensemble learning frameworks to integrate the under-sampling technique. To overcome the deficiencies of some existing ensemble learning algorithms, two new ones, namely,"Clustering Based Asymmetric Bagging Ensemble"(CABagE) and"Modified Asymmetric AdaBoost Ensemble"(MAAdaBE), are proposed, and this is another significant innovation of this dissertation. Numerical experiments for comparison between different algorithms have been conducted on 20 benchmark datasets. The experimental results show that the ensembling SVMs can improve the prediction ability for minority class and usually have better overall performance than the single SVM. Compared with the existing ensemble learning algorithms, both CABagE and MAAdaBE can build the ensembling SVMs with higher prediction ability for minority class. Furthermore, the comparison analyses of experimental results under different metrics demonstrate that MAAdaBE has best overall performance, and this should be attributed to an efficient example-weight smoothing mechanism embedded in it.
Keywords/Search Tags:imbalanced data learning, class imbalance problem, kernel method, support vector machine
PDF Full Text Request
Related items