Font Size: a A A

Evaluation Of Confounder-controlled Random Forest And Its Application In High Dimensional Data Analysis

Posted on:2019-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:J Y LiangFull Text:PDF
GTID:2348330545488048Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
With the rise of multi-omics research such as genomics,epigenetics,and transcriptomics,a large number of high-dimensional data have been generated.The outstanding features of such data are of high dimensionality and sparsity.Besides,the number of variables is much larger than the number of samples.Although there are many variables,most of them are noise variables.Therefore,it is very important to choose appropriate analysis strategies or statistical models to distinguish truly related variables from noise variables.Random forest(RF)is composed of several decision trees,each of which is a classifier and can get a prediction result.All the forecast results are combined to get the final decision,so the model can get better classification and regression performance.Compared to other ensemble learning methods such as bagging and boosting,RF improves algorithm by selecting samples and features randomly and using crossvalidation,which increases speed and greatly reduces the possibility of model overfitting.Now,RF has been widely used in multi-omics data analysis,and favored by its users.However,when there are confounding factors,it is not appropriate to put confounding factors into the random forest as a covariate.This study will explore how to control confounding factors when using RF to explore high dimensional data.We use four methods based on RF in this study: random forest(RF),ranger(random forest GENe Rator),ranger(weighted),random forest based on residuals of general linear model("residual method" or "residual+RF").The purpose of this study is to use simulation experiments to compare these four models based on RF,explore whether confounding factors can be controlled by these four methods when confounders exist,and also compare their controlling effect on confounders.Simulation experiments compare the ratio of the causal variable(causal)ranking first in the variable importance score(VIS)obtained from RF under different parameter settings.The simulation results show that among three parameters(odds ratio(OR),the correlation between causal and confounder(corr1),sample size(N)),if P and any other parameter are fixed,then as the value of the other parameter increases,the proportion of the causal variable ranking first in the VIS in the four models can be higher,which means it's easier to filter out the causal variables.When OR,P and N are constant,the stronger correlation between causal and confounders is,the lower the proportion of the causal ranking first in the VIS would be in the four models.However,no matter how the parameters change,the residual method works best among the four methods,followed with ranger(weighted).There is no large difference between RF and ranger.Compared with RF and ranger,the residual method and the ranger(weighted)correct for the confounding factors,and are better at filtering out the causal variables.This study also conducts case analysis on two different genomics data,which are derived from GWAS(Genome-wide association study)and EWAS(epigenome-wide association study)about non-small cell lung cancer(NSCLC).The analysis of the GWAS data indicates the same conclusion as the simulation experiment.When applying ranger(weighted)method to the analysis of EWAS data,we find that there is a significant correlation between somatic DNA methylation in the KDM gene and early stage NSCLC patient survival,which indicates potential targets for epigenetic therapy,and proves practicality of this method.Both the simulation experiments and case studies show that residual method and the ranger(weighted)method can control the confounders,and improve the ability of models based on RF to filter out causal variables.
Keywords/Search Tags:random Forest, confounder, residual+RF, ranger(weighted), Non-small cell lung cancer
PDF Full Text Request
Related items