Font Size: a A A

Developing Statistical Methods Of Genetic Analysis For Multivariate Data

Posted on:2011-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F ShenFull Text:PDF
GTID:1100330332978346Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In the context of classical multivariate analysis, the number of observations n always exceeds the dimension of variables p, and the most large-sample results of traditional ap-proaches have been explored under the condition of fixed p and infinite n. However, with recent development of high-throughput biological technologies, it has become more and more automatic and popular to collect large-scale data, in which both the sample size and the dimensionality grow quickly. In many practical problems, the number of variables p is comparable to or greater than the number of observations n. These contemporary large-scale data pose various computational and statistical challenges to classical multivariate methods, and call for new statistical methodologies and theories. In this dissertation we fo-cus mainly on some multivariate topics in statistical genetics including hypothesis testing and variable selection, and propose some new statistical methods to deal with challenges created by high dimensionality. We also investigate the performance of the proposed ap-proaches via both simulation studies and real data analysis. The main contents and results of this study are as follows:The first chapter provides a brief introduction of two kinds of high-throughout bio-logical data and some related multivariate statistical problems. The main motivation of this study is to provide useful solutions to the analysis of such large-scale data. Next, we briefly introduce some recent regularization techniques in statistics since these modern tools are used in our study.The second chapter explores the theoretical properties of principal components analy-sis in testing whether multiple correlated predictors (SNPs) in a candidate region influence the trait of interest. Firstly, we provide a test statistic based on principal components regres-sion, and derive the exact power function of this statistic. This result clarifies the relation-ship between the test power and the number of principal components selected, and hence indicates the risk of using the unsupervised rule to select the number of principal compo- nents. Secondly, we introduce a weighted principal components test, which is a general form of many popular test statistics, and compare the performance of these test statistics. At last, we also provide several data-driven adaptive alternatives for bypassing the issue of principal component number determination.The third chapter focuses on multiple-trait quantitative trait loci (QTL) mapping. In biomedical researches, it is frequent to study multiple related complex traits. Most existed QTL mapping methods are based on univariate analysis, and therefore cannot make use of the correlation structure of these traits. In addition, it is not easy for these methods to control type I error for multiple testing problems. In this chapter we propose a two-stage approach to deal with multiple-trait QTL mapping. In the first stage, we use multivariate regression to construct a Wilks's test statistic, which is used to detect any possible QTLs with and (or) without epistatic effects. A permutation procedure is applied to control the experiment-wise false positive rate. In the second stage, after QTL locations have been selected, estimation of various effects can be obtained by univariate mixed linear model approach. Real data analysis and simulation studies demonstrate that our method is appli-cable and powerful.The fourth chapter studies a challenging problem of testing any possible association between a response variable and a set of predictors, when the dimensionality of predictors is much greater than the number of observations. Firstly, in the context of linear model, a new approach is proposed for testing against high-dimensional alternatives. Our method uses soft-thresholding to suppress stochastic noise and therefore improves on power partic-ularly when the alternatives are sparse. Secondly, we extent the proposed approach to the setting of high-dimensional logistic regression. Finally, we compare the performance of this method with some competing approaches via real data and simulation studies, demon-strating that our method maintains relatively higher power against a wide family of alter-natives.The fifth chapter we consider the problem of comparing mean vectors in the context of high dimensional multiple-sample problem, where the dimension of variables is much larger than the sample size. Firstly, in the context of one-sample problem, we propose a new approach called shrinkage-based regularization test to overcome the challenges with high dimensionality. Our approach applies soft-thresholding technique to reduce random noise and improve on testing power. Moreover, an appealing property of this approach is its ability to select relevant variables that provide the evidence against the hypothesis. Next, we extend this approach to settings of multiple-sample problem. The proposed test statistics can be viewed as high-dimensional version of traditional multivariate methods. Lastly, we apply our test statistics to human diabetes data and the results of gene-set analysis illustrate that our statistics are feasible and powerful.In summary, we study the statistical properties of principal component method for testing any possible association between predictors and a response variable, and point out the risk of using the unsupervised rule to select the number of principal components. Our results provide a comprehensive examination of such a method for the benefit of other researchers. Secondly, we propose a new multiple-trait QTL mapping approach based on mixed model, and hence extend univariate-trait QTL mapping methods. This novel method has wide applications in practice. Real data analysis and simulation studies demonstrate that our method is available and powerful. Lastly, we study two kinds of testing problems in high-dimensional data analysis, and propose a series of new test statistics. Our results are not only meaningful in theory, but also much useful in practical applications. Real examples and simulations demonstrate that our methods are both applicable and powerful.
Keywords/Search Tags:multivariate analysis, hypothesis test, variable selection, regularization, high-dimensional data
PDF Full Text Request
Related items