Font Size: a A A

Model selection methods for high-dimensional data and their applications to genome-wide association studies

Posted on:2010-12-06Degree:Ph.DType:Dissertation
University:Yale UniversityCandidate:Wu, ZheyangFull Text:PDF
GTID:1440390002974547Subject:Biology
Abstract/Summary:
Genome-wide association studies (GWAS) scan the whole genome with high-density markers to identify genetic variations associated with complex traits. Because of the great success of GWAS in detecting novel genes for many common-disease susceptibilities, the Science journal selected the study of human genetic variation as the breakthrough of the year 2007. However, fewer loci have been found for some diseases than for others, even though the genetic contributions are believed be equally important. Furthermore, despite of the exciting progresses, currently discovered genes still only account for a small proportion of genetic risks in these diseases. So the identification of additional genetic variations involved in these diseases is the focus of next stage analysis. To facilitate these efforts, I consider methods that can mine the rich information in GWAS data more effectively and efficiently in this dissertation. First, I have derived analytical results for the statistical power of three fundamental methods of model selection (equivalently SNP detection). These results provide a theoretical evaluation basis for different model selection strategies and reveals the mechanism of how genetic signals are captured by statistical model fitting. I have written a R package to help researchers to decide a proper SNP-detection strategy based on certain genetic model assumptions. Second, I proposed a detection-validation joint SNP analysis strategy and applied it to the studies of Rheumatoid Arthritis and Crohn's disease. My method not only replicated gene findings in the literature, but also led to the discoveries of novel genes and gene-gene interactions. In these real GWAS data analyses, I also empirically explored some genetic signal patterns. Third, in the context of penalized model selection, I studied the penalty term as a function of the Lo norm. A theorem is proved to indicate the range of the penalty term that leads the model selection procedure to sharp asymptotic minimax estimation. The relationship between the magnitude of the penalty term and its maximal risk (as a constant times the minimax risk) was also established.
Keywords/Search Tags:Model selection, Genetic, GWAS, Penalty term, Methods, Data
Related items