Font Size: a A A

High-dimension Large-scale Statistical Inference With Applications To Genome Data

Posted on:2015-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:1220330431487618Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This dissertation studies the high-dimension large-scale statistical inference with ap-plications to genome data. With the advent of big data era, big data including genomedata, shows a high-dimensional characteristics and complex correlation. However, theemergence of statistical dependence and high dimensionality (p n) of data brings chal-lenges to many traditional statistical methods and theory. Thence both high dimensional-ity and dependence make large-scale multiple testing problem considerably harder. In thisdissertation, for the analysis of complex dependence and high-dimension (p n) data,frstly, based on the prior weight information, we developed oracle and asymptoticallyoptimal weighted-false discovery rate (WFDR) control procedures under the dependencestructure (Hidden Markov Model (HMM)); then we developed an optimal false discov-ery rate (FDR) control procedures under the extended multi-HMM dependence; next,based on the lasso model, we developed an optimal false discovery rate (FDR) controlprocedures under more general dependence structure. We apply the above theories to thegenome data. Especially, In genome-wide association studies, tens of thousands of testsare performed simultaneously to fnd if any single-nucleotide polymorphisms (SNP) areassociated with some traits and those tests are correlated under the complex unknowndependence structure in high dimensional setting (p n). Then for biomedical imagingdata, which takes the form of high-dimensional array, also known as tensor and has thecomplex structure. In medical imaging data analysis, a primary goal is construct a multi-ple testing procedure to conduct associations between brains and clinical outcomes underthe complex structures.In Chapter1, we frst introduced the data background for single-nucleotide polymor-phism (SNP) in genome-wide association studies (GWAS) and biomedical imaging data,then review important concepts and the existing testing procedures which are relevant to our works. Moreover, we introduced the structure of the full thesis and described themain content of this thesis.In Chapter2, from a bayesian perspective of hypothesis testing, frstly, we consid-ered the dependent observed data which follows a hidden Markov model to catch thecorrelations among hypotheses, then based on the weights information which assess theimportance of each hypothesis, we developed oracle and asymptotically optimal weighted-false discovery rate (WFDR) control procedures that aim to minimize the weighted-falsenon-discovery rate (WFNR) subject to a constraint on WFDR. Then we proposed a noveladaptive method to get the asymptotically optimal weights for our new procedure toanalyze SNP data. Both theoretical properties and numerical performances of the newprocedures proposed were investigated.In Chapter3, we developed a data-driven penalized criterion combined with a dy-namic programming algorithm to fnd change points that divide the whole chromosomeinto more homogeneous regions. Then based on change points, we can obtain a multi-HMM dependence structure or group dependence structure for the SNP data. Further-more, we extended the existing methods, i.e., local index of signifcance (LIS) and pooledlocal index of signifcance (PLIS) to analyze the dependent tests obtained from multiplechromosomes with diferent regions for GWAS under the multi-HMM dependence struc-ture. Then we applied the proposed method, which can deal with the group dependencetests, to a real data.In Chapter4, for high dimensional situation (p n), based on the lasso model,we derived the testing statistics which have the general dependence structure. Thenbased a hidden conditional random mixture model, we developed an optimal FDR controlprocedure for this multiple testing problem. The numerical studies demonstrate that ournew procedure enjoys superior performance. Our approach is also further illustrated bythe real data applications in expression quantitative trait loci (eQTL) mapping.Chapter5proposes an alternative method by a multiple testing procedure. Based onthe proposed new test statistics for conditional dependence, we propose a simultaneoustesting procedure for conditional dependence in Gaussian graphical model (GGM). In Chapter6, we give conclusions and possible future work in this feld.
Keywords/Search Tags:Large-Scale Multiple Testing, High Dimension, Lasso, False DiscoveryRate Control, Compound Decision Theory, Hidden Markov Model, Hidden ConditionalRandom Mixture Model, Matrix Gaussian Graphical Model
PDF Full Text Request
Related items