Font Size: a A A

An empirical study of Classification and Regression Tree and Random Forests

Posted on:2007-09-11Degree:Ph.DType:Thesis
University:State University of New York at Stony BrookCandidate:Xu, BinFull Text:PDF
GTID:2448390005973364Subject:Statistics
Abstract/Summary:
Data explosion and exploration fuels the demand for self-learning methodologies to extract hidden patterns from the training data set, and to provide predictive information for future data set. This thesis is concerned exclusively with Classification and Regression Tree (CART) and its successor Random Forests (RF), especially their classification aspect.; Unlike many other traditional classification approaches, such as K Nearest Neighbor, Discriminant Analysis, Neural Networks, CART can provide data structure information with the capability of interpreting the decision. With bagging and majority vote, RF combines a large number of random trees to achieve an accurate "black-box" model. Its accuracy is comparable to other famous classification methods such as Adboosting and Support Vector Machines, while RF is more stable.; After a comprehensive study, some extensions on CART and RF are proposed to derive a probability scoring system for improved classification accuracy. A well-known drawback of CART is that it is a hard classifier, which limits its usage on practical applications that require some confidence level or posterior probability estimation. In this thesis, we investigate several scoring methods and contribute the s-CART, a new version of CART with a built-in posterior probability scoring system. Analysis of both the traditional machine learning benchmark data sets and the newly emerged proteomic data sets show that s-CART has a better prediction performance than, and a competitive speed to, the traditional CART.; The Random Forests, an extension of CART where the final classification decision is made by taking the majority vote of multiple CART's generated via Bootstrap Resampling, is hailed as the most promising classifier developed to date. After intensive literature search and study to understand its mechanism and properties, we introduce a third randomness to RF---a random splitting method at each non-leaf node through the entire tree, and the resulting RF is entitled as Forest-RS. The advantage of this randomness is studied theoretically and numerically. Comparisons are made between RF-RS, and the traditional RF. Additionally, we studied the stability of RF, especially the diversity of randomness, which appears to be an untouched research field.; A major contribution of this thesis work is a rapid and free-of-platform software developed to implement the traditional CART (including both the classification tree and the regression tree), the traditional RF (including Forest RI and Forest-RC), as well as the newly developed s-CART, and RF-RS. Comparing to other known software, either public or commercial, it provides much more model information, more flexibility for parameter tuning and more advanced features such as variable ranking, missing data processing and robust study, etc. Most of all, it realizes the thoughts in this thesis, which is unique. Key Words: Classification and Regression Tree, Decision Tree, Random Forests, Resampling, Scoring.
Keywords/Search Tags:Classification, Regression tree, Random forests, CART, Data, Thesis, Scoring
Related items