Font Size: a A A

Novel Random Forest and Variable Importance Methods for Clustered Dat

Posted on:2018-06-06Degree:Ph.DType:Dissertation
University:San Diego State UniversityCandidate:Calhoun, Peter MontgomeryFull Text:PDF
GTID:1478390020955992Subject:Computer Science
Abstract/Summary:
Tree-based methods are becoming increasingly popular due to their few statistical assumptions and accurate predictions. Classification and Regression Trees (CART) can handle a variety of data structures and give easy to interpret prediction rules. However, there are several limitations with CART including requiring independent outcomes, having high variance, giving poor predictive performance, and inducing a variable selection bias. In this dissertation, we discuss these limitations and propose algorithms that resolve these issues.;In Chapter 1, we introduce CART and discuss the advantages with tree-based methods. We show CART handles interactions and nonlinear relationships and provides easy to interpret prediction rules. We conclude with an example and discuss some of the limitations with the standard CART implementation.;In Chapter 2, we discuss the MST R package which extends the CART implementation to handle multivariate survival data. We introduce multivariate survival trees and illustrate how they can be constructed in R. We discuss some of the features of the MST R package. We analyze a dental study to predict tooth loss and estimate survival of molars and non-molars. We conclude with future directions of the MST R package.;In Chapter 3, we introduce random forests. Random forests reduce the variance from CART and are one of the most accurate machine learning methods to make predictions and analyze studies. However, the variable selection bias found in CART still occurs with random forests. We propose a variant of the random forest called completely randomized with acceptance-rejection trees (CRAR). We compare our proposed method with three other methods of constructing random forests: standard random forest (RF), smooth sigmoid surrogate trees (SSS), and extremely randomized trees (ER). We find CRAR and ER have the best overall accuracy and performance for classification problems. They have the lowest misclassification rates, reduce or eliminate the variable selection bias, and are the fastest algorithms. The best algorithm for regression problems may be selected based on the overall objective --- whether it be high accuracy, variable selection, or speed. We recommend considering all four algorithms based on the study and objective.;In Chapter 4, we propose the repeated measures random forest (RMRF) algorithm that extends the standard random forest implementation to handle longitudinal designs. The RMRF algorithm uses subsamples, the robust Wald statistic, and an accept-reject quality control step to grow an ensemble of trees. We adopt an area under the curve (AUC) based permuted importance method to assess variable importance. We show the RMRF algorithm outperforms other algorithms that naively assume independence under a variety of data simulations. An algorithm that ignores the dependence will favor patient-level variables for strongly correlated responses. We also show the RMRF algorithm outperforms RF and ER at identifying the informative variable.;The final chapter uses the RMRF algorithm to identify factors associated with nocturnal hypoglycemia. We adopt a permuted importance method to test significance of factors with random forests. We find hemoglobin A1c (P=0.01), bedtime blood glucose (P=0.01), insulin on board (P=0.03), time system activated (P=0.02), exercise (P=0.01), and daytime hypoglycemia (P=0.01) are associated with nocturnal hypoglycemia. We show interaction effects affect hypoglycemia and explore the significance of time system activated. Finally, we assign risk profiles to each night and show the RMRF algorithm accurately predicts nocturnal hypoglycemia. We conclude the proposed RMRF algorithm can identify influential variables while handling dependent outcomes.
Keywords/Search Tags:RMRF algorithm, Variable, Random forest, CART, Methods, Nocturnal hypoglycemia, Importance, Trees
Related items