On the consistency of ensemble classification algorithms

Posted on:2008-09-25

Degree:Ph.D

Type:Dissertation

University:University of California, Berkeley

Candidate:Traskin, Mikhail Petrovich

Full Text:PDF

GTID:1448390005457533

Subject:Statistics

Abstract/Summary:

The fields of machine learning and statistics experienced fast growth in the past 25 years. A lot of theoretical and empirical research was done in the area of classification and a lot of new algorithms were proposed. Not all of them survived the scrutiny of tests of their effectiveness, but some were shown to be very effective and are widely used in practice even if their theoretical foundations are unclear.;In this work we study two of the most successful classification and regression algorithms that were introduced in the past 15 years---AdaBoost and Random Forests. In both cases we are interested in the consistency of these algorithms in the classification setting.;In the case of the AdaBoost algorithm we investigate the risk, or probability of error, of the classifier produced by the algorithm. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1-&egr; iterations---for sample size n and &egr; ∈ (0, 1)---the sequence of risks of the classifiers it produces approaches the Bayes risk. This answers the long-standing question of whether the AdaBoost algorithm in its original formulation is consistent or not.;For the Random Forests algorithm the goal was also to establish consistency or show the lack of it since there seems to exist some confusion about the algorithm: Breiman's words [17] "the generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large" are interpreted, at least sometimes, as a statement on the consistency of the algorithm. We consider a classification setting and provide a simple proof that in the one-dimensional case Random Forests, as formulated in [17] and implemented in the randomForests R package, is nothing but a 1-nearest neighbour classifier, hence it is not consistent. The proof also reveals that the algorithm can be made consistent by choosing the bootstrap sample size sublinear in the training sample size. A simulation study suggests that the same might be true in higher dimensions.

Keywords/Search Tags:

Algorithm, Consistency, Classification, Sample size

Related items

1	Research On Feature Selection Algorithms Of High-dimensional And Small-sample Size Based On Neighborhood Consistency
2	The Study Of Complex Data Processing Method Based On Classification
3	Precision-based sample size reduction for Bayesian experimentation using Markov chain simulation
4	Large dimension and small sample size problems: Classification, gene selection and asymptotics
5	Sample size estimation with nonparametric methods for one sample location tests under clustered data
6	Partition clustering of high dimensional low sample size data based on p-values
7	A multi-faceted model of the consequences of sample size choice in usability testing
8	Research On Feature Extraction Algorithms And Face Recognition In The Case Of Small Sample Size Problem
9	Study Of Face Recognition With Small Sample Size
10	Biometric Recognition Based On Samll Sample Size Problem