Font Size: a A A

On the consistency of ensemble classification algorithms

Posted on:2008-09-25Degree:Ph.DType:Dissertation
University:University of California, BerkeleyCandidate:Traskin, Mikhail PetrovichFull Text:PDF
GTID:1448390005457533Subject:Statistics
Abstract/Summary:
The fields of machine learning and statistics experienced fast growth in the past 25 years. A lot of theoretical and empirical research was done in the area of classification and a lot of new algorithms were proposed. Not all of them survived the scrutiny of tests of their effectiveness, but some were shown to be very effective and are widely used in practice even if their theoretical foundations are unclear.;In this work we study two of the most successful classification and regression algorithms that were introduced in the past 15 years---AdaBoost and Random Forests. In both cases we are interested in the consistency of these algorithms in the classification setting.;In the case of the AdaBoost algorithm we investigate the risk, or probability of error, of the classifier produced by the algorithm. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1-&egr; iterations---for sample size n and &egr; ∈ (0, 1)---the sequence of risks of the classifiers it produces approaches the Bayes risk. This answers the long-standing question of whether the AdaBoost algorithm in its original formulation is consistent or not.;For the Random Forests algorithm the goal was also to establish consistency or show the lack of it since there seems to exist some confusion about the algorithm: Breiman's words [17] "the generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large" are interpreted, at least sometimes, as a statement on the consistency of the algorithm. We consider a classification setting and provide a simple proof that in the one-dimensional case Random Forests, as formulated in [17] and implemented in the randomForests R package, is nothing but a 1-nearest neighbour classifier, hence it is not consistent. The proof also reveals that the algorithm can be made consistent by choosing the bootstrap sample size sublinear in the training sample size. A simulation study suggests that the same might be true in higher dimensions.
Keywords/Search Tags:Algorithm, Consistency, Classification, Sample size
Related items