Font Size: a A A

Using the C-index to measure prediction accuracy and variable importance of random forests, with application to tissue microarray data

Posted on:2005-06-22Degree:Ph.DType:Thesis
University:University of California, Los AngelesCandidate:Huang, YundaFull Text:PDF
GTID:2458390008996862Subject:Biology
Abstract/Summary:
Tissue microarray (TMA) is a state-of-art technique for high throughput molecular analysis of large number of tumor samples in a single staining reaction. TMAs allow one to evaluate highly specialized tumor marker expression patterns which may lead to improved diagnostic, prognostic and therapeutic applications in the clinic.;Typically, TMA data have relatively few observations yet many highly skewed and correlated covariates with weak marginal effects. Random forests (RF) predictors [Bre01] are known to produce improved accuracy with such data. In this thesis, we propose to use the C-index as an alternative prediction accuracy measure to the error rate for RF predictors. Unlike the error rate, the C-index compares the overall distribution of the posterior predictions and sidesteps the need to specify the cost function and the classification threshold. We prove that the C-index is far superior to the error rate in determining the prediction accuracy and variable importance of RF predictors in certain situations. We also introduce a C-margin to measure the prediction strength of individual observations. Based on these C-margins, we propose new measures of variable importance. We show that the C-margin based importance measures are superior to the current E-margin based importance measures and the Gini index especially when the class prevalence is unbalanced. We apply our proposed methods to benchmark data from the UCI repository and to our simulated data.;We extend the use of the C-index and C-margins to other important data areas such as those with continuous outcomes and censored outcomes, We find that the C-margin based variable importance measures often outperform existing measures. Furthermore, we extend the local full likelihood method proposed by LeBlance and Crowley [LC92] for the construction of residual-based survival random forest predictors, Employing the C-index and the C-margins, we find that our residual-based survival random forests predictor outperforms Breiman's survival random forest predictor (2001) especially in finding important covariates.
Keywords/Search Tags:Random forests, Variable importance, Prediction accuracy, C-index, Data, Measure
Related items