Font Size: a A A

Variable Selection Problems Using Bayesian Method And Graph-constrained Regularization For Analysis Of High-dimensional Genomic Data

Posted on:2011-09-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y WangFull Text:PDF
GTID:1100330332481359Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The problem of variable selection has received enduring attention in both statistical literature and various applications of the past decades. How-ever, from the past few years to now, a frequently encountered challenge of variable selection is the number of the variables may be much larger than the sample size. For example, in microarray data analysis, tens of thousands of genes express as explanatory variables while usually fewer than 100 exam-ples (patients) are available altogether for training and testing. This kind of high-dimensional data with "large p but small n" brings about "curse of dimensionality" which makes most traditional statistical methods unsuitable or inefficient for data analysis.Lots of successful theories and methods have been proposed with respect to high-dimensional variable selection. In this thesis, we also consider this attractive problem in the linear regression and make an application mainly for genomic data. We consider the following normal linear model y= xTβ*+εwhere y is the response variable, x= (x1,x2…,xp)T is the p-dimensional explanatory variables vector,εis the error term obeying the normal distribu-tion with mean zero and fixed standard deviationσ, andβ*= (β1*,β2*,…,βp*)T is the true regression, coefficients vector. Throughout this thesis, we will of course typically assume that the number of explanatory variables p can be much larger than the sample size n. The n subjects of y and x are de-noted as Y= (y1,y2,…,yn)T and X=(X1T,X1T,…,XnT)T with Xi= (xi1,xi2,…,xip) is the ith sample of explanatory variables vector x.We mainly study high-dimensional variable selection in the following two aspects:1. On variable selection consistency of Bayesian methodDiffering with traditional frequentist methods, we will consider Bayesian approach for variable selection (BVS) in Chapter 2. In high-dimensional settings, BVS has gained competitive, even more superior empirical successes compared with frequentist methods in a variety of applications. We aim to study the theoretical reasons:why Bayesian method gains these empirical successes on variable selection.The methodologies and concepts of BVS can be understood and de-scribed as follows. First, a auxiliary indicator variableγ=(γ1,γ2,…,γp) is defined to specialize the subset model to a regression setting, with Then, givenγ, we can get x_ andβγ∈R│γ│,whereνγdenotes the subvector of a vectorνwith components{νj}, for all j's withγj= 1 and is the L1 norm of vector v. Hence, for the linear model referred, the parameters areγandβγ. Given a proper prior distribution and conditional on the observed data, the posterior distribution of the parameters can be derived and the models which have large posterior probabilities can be used to make statistical inference.We specify the true joint density by letting x have a uniform distribution and assume that the true regression coefficients vectorβ* is sparse, in the sense of . We specify two proper conditions on prior distributionπn ofγandβγ, with one requiringπn large enough on an approximation neighborhood of the true model and the other requiringπn less enough on a too complex model. Then, deduced by the posterior distribution, the posterior estimate of regression function is asymptotically consistent for the true regression functionμo(x)= Efo(f│x), i.e., It indicates that BVS is successful in identifying several promising subset models that result in good performance in regression. This feature is very useful in narrowing the scope of variable selection.As a special case of sparseness condition when some regression coefficients are bounded away from zero, while the rest are exactly zero, we aim to show that BVS can identify the true model consistently. We integrateβγout from the posterior distributionπn(γ,βγ│Y,X) and get Let modelγhas the largest posterior probability ofπn(γ│Y,X) within the considered class of subset models andβA is the posterior estimate ofβ* based on the selected modelγ. In Chapter 2, we show that under some conditions, the modelγconverges to the true model by means of This L2 consistency implies that BVS chooses the important variables with high probability and that falsely chosen variables have very small coefficients.Simulation studies and a real data analysis demonstrate that Bayesian variable selection performs at least comparable to Lasso and Dantzig.2. Variable selection and estimation based on graph-constrained regularization in high-dimensional genomic data analysis Graphs and networks are common ways of depicting biological infor-mation. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways. The liked genes have high pairwise correlations and form groups (molecular modules) to af-fect the clinical phenotypes/outcomes. In Chapter 3, in order to incorporate the information from these graphs into an analysis of the numerical data, we introduce a graph-constrained regularization procedure for fitting the linear regression models and for identifying the relevant groups of genomic variables related to complex phenotypes.Motivated by the graphical structures of genomic data, Li et al. (2008 and 2010) firstly proposed a graph-constrained regularization procedure (Grace) to utilize the graph information for variable selection in the framework of regression analysis. Grace works by connecting a penalty induced by the Laplace matrix of the graph to Lasso procedure. Such a procedure can not only enjoy a similar sparsity as Lasso but also encourage a grouping effect and enjoy global smoothness of the coefficients over the graph.We define a new graph-constrained estimate (N-Grace) of the regression coefficients where is the objective function. Obviously, our proposed procedure is similar in spirit to the Grace procedure. However, it is different from Grace in that our procedure does not require smoothness on the coefficients of the genes asso-ciated with the graph. Our penalty is developed to require the coefficients of the group genes in a subgraph should be nonzero/zero simultaneously, when the subgraph is relevant/irrelevant to the regression model. Particularly for variable selection, our procedure seems to be more reasonable than Grace. In the solution of the N-Grace procedure, "one at a time" coordinate-wise descent algorithm is considered. At each iteration, the objective function Q*(β) is minimized with respect to one of the coordinate while the other coordinates are held fixed. Finally, we demonstrate the application of the methods to both simulated and real SNP data analysis.3. Bayesian variable selection for dependent explanatory variables and stochastic search for the best subset modelIn Chapter 4, we consider Bayesian variable selection for dependent ex-planatory variables. Given the response y and the explanatory variables vector x= (x1,x2,…,xp)T, we assume that there are at most pmax ex-planatory variables related to the response y. Inspired by Grace procedure in Li et al. (2008 and 2010), we treat all the explanatory variables in our settings as contained in a p-dimensional network graph with the dependent variables linked in the graph and try to use the idea of graph-constrained regularization to develop the prior distribution of the regression coefficients vector. The edges in the resulting association network graph connect'pairs of dependent variables whose pairwise associations are different from 0. We also use the notations and definitions of the weighted graph G= (V, E, W) as in Chapter 3. Given a subset modelγ, we develop a Bayesian formulation for the graph-constrained estimation in Li et al. (2008 and 2010), with the conditional prior of regression coefficients vectorβγbeingAs there are at most pmax explanatory variables related to the response y, only subset models need to be considered-a significant reduction compared to the total 2P possible models. Denote the space of candidate subset models as Rpmax-the set of models with at most pmax variables. We note that throughout this chapter, all the modelsγ's considered are with in the scope of the space of Rpmax. Hence, we only put prior distributionπn ofγon Rpmax. There is no substantive need to further penalize for model complexity and we assume throughout that the models in Rpmax arc apriori equally likely, i.e., where|Under the above priors and conditional on the observed data set D, we get the posterior distribution of subset modelγas BVS is to pursuit modelγwhich has the largest posterior probability within the considered class of subset models.In computational stage, instead of using the traditional MCMC algo-rithms, we propose a stochastic search algorithm M-BMSS to pursuit the subset model which reaches the largest posterior probability.
Keywords/Search Tags:Variable selection, high-dimensional data, curse of dimensionality, linear models, Bayesian method, density consistency, regression consistency, Gibbs sampling, genomic data, graph-constrained regulariza-tion, coordinate-wise descent algorithm
PDF Full Text Request
Related items