Font Size: a A A

Research On The Imbalanced Data Learning

Posted on:2012-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiFull Text:PDF
GTID:1118330332499390Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the progress of science and technology, particularly the rapid development of information technology, there are so much number of data with different forms produced in businesses that people are facing to a data explosion era, even though that the information hidden in the data is still necessitous. For this reason, data mining technology came into being in order to get the useful information from huge amounts of data. Data mining is a non-trivial process to extract valid, novel, potentially useful and ultimately understandable model from a large number of data. Classification, also known as supervised learning, is a major task in data mining, and is in charge of building the classification model on training data sets, and then using the model to classify the unknown label data. There are two main research issues:designing classification algorithm, classification model evaluation and model selection,Imbalanced data widely exists in practice. It is different with the balanced data because data distribution is not balanced and data characteristics of the minority class are not expressed sufficiently. This leads to that imbalanced data classification is different from balanced data classification. The major strategies of solving the problem of imbalanced data classification include resampling technique, ensemble learning, cost-sensitive learning. In additional, the performance evaluation methods and metrics of imbalanced data classification algorithms are also major research focuses.The research areas in this paper are based on in-depth exploration of various strategies in imbalanced data learning, and are as follows.(1) An imbalanced data learning algorithm, PCBoost, is proposed. The algorithm combines Boosting with sampling technology to solve the low performance problem of classifier caused by insufficient minority class information. It also simulates human learning model to put new learning information into various stages of learning and "correct" errors in time after a learning stage.The PCBoost algorithm is divided into four stages. In the first stage, sample weights are initialized. In the second stage, new examples are synthesized by making use of minority class data information to balance the minority class training information. A weak learning algorithm is called in the third stage to complete the stage of learning and correct the perturbation data according to the results of sub-classifier training. The second and the third stages are repeated until the number of iterations T is reached. In the fourth stage, sub-classifiers are ensembled, and the final classifier is gotten through ensembling all learning stages.A data synthesis method is presented based on data distribution to deal with the data synthesis problem in PCBoost. This method applies different attribute values synthesis methods to the discrete attributes and continuous attributes respectively, and then generates the attribute values before they are assembled to form a synthesis example. The training error band of PCBoost is proved, and the choice of algorithm parameters is discussed. The PCBoost algorithm is compared with decision tree algorithm, the standard AdaBoost and the two algorithms which combines boosting to sampling with the F-measure and G-mean as performance evaluation metrics. The experimental results show that the PCBoost algorithm has more advantages in dealing with imbalanced data classification problem.(2) Classification boundary biases toward majority class when traditional classifier directly train in whole imbalanced data set, because that the information expression of minority class is not equate, and the information of majority class is dominant. In fact, it is not all of majority samples that are meaningful go find the classification boundary, but those examples near the classification boundary are useful to accurately get classification boundary.Based on the above considerations, an imbalanced data classification algorithm based on undersampling is proposed. The algorithm undersamples majority class sample, determines whether a majority class example can be discarded according to whether some minority class examples are contained in its neighborhood, and only retains the majority class examples near classification boundary. In order to select the most appropriate neighborhood radius, we regard AUC as the optimized goal to balance data, and train Bayesian classifier on data sets after undersampling. The experiment results on the simulation data and UCI data sets with AUC as a measure of classifier performance evaluation show that the undersampling strategy is effective.(3) Because the misclassification cost of the minority class is higher than that of majority class in the case that the data distribution is imbalanced. In this situation, the accuracy or error rate is no longer suitable for the performance evaluation of imbalanced data classification. For unbalanced data classifier, the accuracy of minority class (i.e. True Positive rate, TPrate) should be paid more attention by performance evaluation methods and metrics. Based on this consideration, an imbalanced data classifier performance evaluation metric, weighted AUC, is proposed. The method gives more weights to the area under ROC curve with higher TPrate, and makes the performance evaluation metric wAUC more biased towards the classifier with better performance in minority class.The properties of the weight function are given, and the basic properties and statistics properties of wAUC are discussed. When two classifiers have same AUC, the sufficient condition about which classifier is better is discussed. Naive Bayesian classifier and Radial Basis Function neural network classifier are used in UCI data sets, and the experiment results show that wAUC is better than OP and AUC for imbalanced data classifier performance evaluation.Based on the Analytic Hierarchy Process, we present a theory framework of model selection. The framework makes use of many kinds of metrics to comprehensively evaluate classifiers.In conclusion, the imbalanced data learning problem is researched in the light of resampling, ensemble learning and performance evaluation; the PCBoost and the algorithm based on undersampling are proposed and showed advantages by the experiment results on the UCI data sets; the performance evaluation metric, wAUC, is proposed and showed superior to AUC and OP by the theoretical analysis and experiments in this paper. At the same time, there are still some shortcomings and areas for further study. For example, the relationship between the error band of PCBoost and that of AdaBoost, the random perturbation analysis of PCBoost, are all worthy to further research in theory.
Keywords/Search Tags:Data Mining, Machine Learning, Classification Algorithm, Imbalanced Data Learning, Boosting Algorithm, Resampling, Performance Evaluation
PDF Full Text Request
Related items