Font Size: a A A

The Performance Of Random Survival Forest Applied In High Dimensiolal Survival Data

Posted on:2013-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:G X ChenFull Text:PDF
GTID:2234330374992747Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
With the continuously development of survival analysis, many researchers have found that even the patients with similar clinical factors may also have different prognosis. More and more researchers are focusing on the effects of genetic factors which are considered to have an influence on prognosis. With the falling cost and rapid improvements in SNP genotyping technology, the datasets of genetic studies has been increased dramatically. The main feature of these datasets is that:the number of predicted variables (p) is great larger than the number of individuals observed in the research (p﹥﹥n). The commonly used individual SNP analysis method corrected by multiple comparison, such as Bonferroni correction, are too conservative and not appropriate when applied to genetic studies. In the genetic high-dimensional data, most of the genetic variants have no biological significances, referred to as noise SNPs. On the contrary, only a tiny fraction of the genetic variants are associated with the patients’survival, referred to as risk SNPs. So the biostatisticians proposed a new analysis strategy. First we could eliminate the noise SNPs from the dataset to reduce the dimension of data, and then apply Cox regression analysis to selected subset for further study.Machine learning methods are common methods to reduce the dimension of data, such as random forest, support vector machine, artificial neural network. Random forest algorithm doesn’t need to specify the distribution of parameters, which gives estimates of what variables are important in the classification. It also generates an internal unbiased estimate of the generalization error as the forest building progresses. Considerable empirical evidence has shown RF to be highly accurate. Because of its outstanding performance, RF is constantly being used in high dimensional data analysis. Random survival forest (RSF) is a nature extension of Breiman’s random forest method, to be used for analysis of right-censored survival data. In this paper, we applied the RSF to estimate variables’important score in complex genetic simulation datasets and a real genotyping dataset. Our objective is to evaluate the performance of RSF in screening the risk SNPs.Simulation models were set as follows:(1) the proportional hazard model with one risk SNP;(2) the proportional hazard model with two risk SNPs; and (3) the proportional hazard model with five risk SNPs. The linkage disequilibrium (LD) between a true risk SNP and other noise SNPs and the hazard ratio (HR) of risk SNPs were also set in different levels.Primary results were as follows:(1) The Cox model with one risk SNP:Along with the increase of the hazard ratio of the risk SNP, the importance score of the risk SNP was higher and the proportion of the risk SNP ranked in the top by the average importance score (Mean VIMP) was also increased.(2) The Cox model with two risk SNPs:When the HR of risk SNPs are1.2and1.4respectively, if the max r2increased, the importance score of the risk SNPs would be reduced, and the proportion of the risk SNPs ranked in the top four by the Mean VIMP was also decreased. When the HR of the risk SNPs were both1.4, the performance of RSF was well enough. Even though the max r2was large, the proportion of two risk SNPs ranked in the top four was89.0%by Mean VIMP.(3) The Cox model with five risk SNPs:In scenario3.1, the proportion of the five risk SNPs which are in lowest LD with other SNPs ranked in the top seven was92.5%by Mean VIMP. In scenario3.2, the proportion of the five risk SNPs which are in LD with other SNPs in different levels ranked in the top seven is80.6%by Mean VIMP.(4) The real data analysis:RSF approach was used to reduce the dimension of a dataset with399SNPs from120lung cancer patients. Cox proportional risk models were fitted on the SNP set with high importance scores and low misclassification rate. Cross-validation was used to evaluate the model’s predictive ability. The results showed twenty-five important SNPs were selected by RSF. After adjusted for clinical covariates (clinical stage, surgery and histopathology), four SNPs were statistically significant in multiple Cox model. The cross-validation procedure indicated that the average accuracy of the model was83.63%.The main conclusions of this research were as follows:The stronger HR of risk SNPs would lead to the higher importance score of risk SNPs estimated by RSF. When risk SNPs are in LD with non-risk SNPs, the higher LD between a risk SNP and noise SNPs may lead to the decrease of variable importance(VIMP) of the risk SNP and the weaken performance of RSF in screening risk SNPs. In order to avoid omission of important SNP, it’s necessary to keep a little more SNPs in RSF for further analysis. In brief, Random survival forest is a promising method for dimensionality reduction of high dimensional survival data.
Keywords/Search Tags:Random survival forest, High dimensional data, Survival analysis, Linkage disequilibrium, Hazard ratio
PDF Full Text Request
Related items