Font Size: a A A

Statistical Methods For Analysis Of High Dimensional Genomic Data

Posted on:2017-02-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z L LiFull Text:PDF
GTID:1310330536459097Subject:Statistics
Abstract/Summary:PDF Full Text Request
An important goal of human genetic research is to identify the genetic basis for human diseases or traits.The existing gene-or region-based methods test for the association of an outcome and the genetic variants in a pre-specified region,e.g.,a gene.In view of massive inter-genetic regions in whole genome association studies,we propose a quadratic scan statistic based method to detect the existence and the locations of signal regions by scanning the genome continuously.The proposed method accounts for the correlation(linkage disequilibrium)among genetic variants,and allows for signal regions to have both causal and neutral variants,and causal variants whose effects can be in different directions.We study the asymptotic properties of the proposed scan statistics.We derived an asymptotic threshold that controls for the family-wise error rate,and show that under regularity conditions the proposed method consistently selects the true signal regions.We performed simulation studies to evaluate the finite sample performance of the proposed method.Our simulation results showed that the proposed procedure outperforms the existing methods,especially when signal regions have causal variants whose effects are in different directions,or are contaminated with neutral variants,or the variants in signal regions are correlated.We applied the proposed method to analyze a lung cancer genome-wide association study to identify the genetic regions that are associated with lung cancer risk.Another important topic of genetic research is to estimate the effect size of selected signals.Penalized likelihood methods provide an attractive approach to perform variable selection and regression coefficient estimation simultaneously.Motivated by this,we propose variable selection and estimation in generalized linear models using the seamless L0(SELO)penalized likelihood approach.The SELO penalty is a smooth function that very closely resembles the discontinuous L0 penalty.We develop an efficient algorithm to fit the model,and show that the SELO-GLM procedure has the oracle property in the presence of a diverging number of variables.We propose a Bayesian Information Criterion(BIC)to select the tuning parameter.We show that under some regularity conditions,the proposed SELO-GLM/BIC procedure consistently selects the true model.We perform simulation studies to evaluate the finite sample performance of the proposed methods.Our simulation studies show that the proposed SELO-GLM procedure has a better finite sample performance than several existing methods,especially when the number of variables is large and the signals are weak.We apply the SELO-GLM to analyze a breast cancer genetic dataset to identify the SNPs that are associated with breast cancer risk.
Keywords/Search Tags:Scan statistics, Genome-wide association studies, Asymptotics, Generalized linear models, Variable selection
PDF Full Text Request
Related items