Font Size: a A A

New methods for variable selection with applications to survival analysis and statistical redundancy analysis using gene expression data

Posted on:2008-09-07Degree:Ph.DType:Thesis
University:Case Western Reserve UniversityCandidate:Hu, SiminFull Text:PDF
GTID:2440390005965103Subject:Biology
Abstract/Summary:
An important application of microarray research is to develop cancer diagnostic and prognostic tools based on tumor genetic profiles. For easy interpretation, such studies aim to identify a small fraction of genes to build molecular predictors of clinical outcomes from at least thousands of genes thus require methodologies that can model high dimensional covariates and accomplish variable selection simultaneously.; One interesting area is modeling cancer patients' survival time or time to cancer reoccurrence with gene expression data. In the first part of this dissertation, we propose a new penalized weighted least squares method for model estimation and variable selection in accelerated failure time models. In this method, right censored observations are used as censoring constraints in optimizing the weighted least squares objective function. We also include ridge penalty to deal with singularity caused by collinearity and high dimensionality and use the least absolute shrinkage and selection operator to achieve model parsimony. Simulation studies demonstrate that adding censoring constraints improves model estimation and variable selection especially for data with high dimensional covariates. Real data examples show our method is able to identify genes that are relevant to patient survival times.; Another interesting area is cancer subtype classification using gene expression profiles. One important issue is to reduce redundancy caused by correlation among genes. Since genes with correlated expression levels may be co-expressed or belong to the same biological pathway related to the disease, including such genes into classifiers provides very little additional information. In the second part of the dissertation, we define an eigenvalue-ratio statistic to measure a gene's contribution to the joint discriminability of a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the effectiveness of our eigenvalue-ratio statistic based gene selection methods. We also demonstrate that the selected compact gene subsets can not only be used to build high quality cancer classifiers but also have biological relevance.
Keywords/Search Tags:Gene, Variable selection, Statistical redundancy, Cancer, Methods, Data, Survival
Related items