Font Size: a A A

Categories Of Unbalanced Data Integration Classification Research

Posted on:2013-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:C W WangFull Text:PDF
GTID:2248330371969597Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is one of the most important fields of machine learning and partternrecognization in recent years.There is a special case in classification problems: The sample is ofgreat difference in number, and this type of data set is called unbalanced dataset, in whichcategory conterned less amount of samples is called minority and the other categories are calledmajority. The traditional classification can take advantage of performance on the balanceddatasets, but on the unbalanced datasets tends to take minority as majority for the purpose ofhigher classification accuracy. While the minority is essential in the entire dataset, themisclassification will pay a higher cost than the misclassification of majority. Improving theperformance and generalization ability of classifier on the unbalanced datasets is of great valueand significance. In bank credit risk assessment system, the probability of normal lendingtransactions is much higher than that of bad credit, but the bank staff concerns the smallpercentage of bad transactions from the normal practices. Text detection, product quality testing,spam filtering are typical applications of unbalanced datasets. For simplicity this articleconsiders only the two-category classification; the multiple classification can be converted to anumber of dichotomous questions to resolve.In order to improve the performance of classifier on unbalanced datasets, manyimprovement algorithm are put forward. Such as the cost-sensitive algorithm, SMOTE resampletechnology, improved SVM, the One-Sided selection algorithms as well as lazy learning. Thealgorithms are improved in two main areas: one is through undersampling the majority andoversampling the minority, changing the data distribution to get a basic balance of datasets, andthen using the traditional classification algorithms. Another is to keep the original distribution ofthe dataset, starting at the algorithm level by adjusting the sample training weights so that theclassifier is better to take care of a minority. Even so, the accuracy of the minority on unbalanceddatasets is still very low. Inspired by the PAC model of Valiant, many experts and scholars hopeto make weak learner on unbalanced datasets into strong learner which can effectively improvethe performance of minority through integration. But in traditional integration algorithm therelationship between the infimum of gma and the error rate is opening up quadratic function.Simply reducing the unbalanced datasets error rate does not improve classification accuracy ofminority. This article reviews the basic knowledge in integrated learning and the mainstreammodel, then illustrates a variety of selective integration method, discusses the advantages anddifficulties of the current ensemble learning methods. The article mainly analyzes variousimproved algorithm of unbalanced datasets. On the basis of the inspiration by "most informationstrategy", two improved algorithms will be put forth and the former one will be verified by experiment.The main research object of the paper includes:1. Reviewing and summarizing the various combinations of base classifiers, particularlyselective integration method. On the basis of the former theory analysis, the paper analyzes thestatistical significance of two kinds of mainstream to sampling methods, that is JackKnife andBootstrap. And then point out that in the conditions of IID(independent identically distributed)and finite moment, good convergence of the data model can be obtained through the resamplingof several. The paper also analyzes the distribution characteristics of the unbalanced datasets,discusses the classification evaluation criteria for unbalanced datasets, and inducts the improvedclassification algorithm on the unbalanced dataset.2. The paper has put forth a new algorithm for unbalanced datasets-ILAdaboost on the basisof the integrated learning program. The algorithm uses base classifier learning from eachiteration to evaluate the raw datasets and divides the original dataset into four disjunct subsetsbased on the assessment results. And then form a balanced data set for the next iteration of baseclassifiers by resample in four subsets. Because the minority and the wrong majority will bechosen more probably, the interface synthesis of classifier will deviate from the minority. Theexperimental results of 10 UCI datasets and 2 simulation dataset confirm the validity of thealgorithm.3. Under the guidance of "most information strategy", an improved proposal was made atthe algorithm level referring to unbalanced datasets distribution. The majority and minoritysamples are assigned different initial weights individually according to category population. Theassessment on the original dataset will be conducted after the training of base classified, andupdated the next iteration weights of training samples according to the different predictions.Theoretically, this method can reasonably take care of the minority and will not cause too muchaccuracy loss of majority.
Keywords/Search Tags:ensemble learning, unbalanced datasets, most information strategy, AdaBoost, resample
PDF Full Text Request
Related items