Sparse estimation and oracle properties of regularized regression with non-polynomial dimensional covariates | | Posted on:2012-02-28 | Degree:Ph.D | Type:Thesis | | University:Princeton University | Candidate:Bradic, Jelena | Full Text:PDF | | GTID:2468390011459178 | Subject:Biology | | Abstract/Summary: | PDF Full Text Request | | This thesis is giving more insights into the problem of regularized variable selection methods when both the true and intrinsic dimensionality of parameter space grows with the sample size. It is mostly comprised of work done on establishing explicit and tight rates of convergence on different variable selection properties of regularized methods. The settings changes from classical linear models to Cox's hazards regression models.;In high-dimensional model selection problems, penalized least-square approaches have been extensively used. The first section of this thesis addresses the question of both robustness and efficiency of penalized model selection methods when the dimensionality of covariates explodes exponentially fast with the sample size. It proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1 -penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L 1-penalty with weights as linear approximation of non-convex functions is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L1-penalty. In this high-dimensional setting, we establish a strong oracle property with exact rates of convergence of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.;High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related censored clinical data have increased demanding need for better measurement specific model selection. In the second section we establish strong oracle properties of non-concave penalized methods for non-polynomial (NP) dimensional data with censoring in the framework of Cox's proportional hazards model. A class of folded-concave penalties are employed and both LASSO and SCAD are discussed specifically. We unveil the question under which dimensionality and correlation restrictions can an oracle estimator be constructed and grasped. It is demonstrated that non-concave penalties lead to significant reduction of the "irrepresentable condition" needed for LASSO model selection consistency. The large deviation result for martingales, bearing interests of its own, is developed for characterizing the strong oracle property. Moreover, the non-concave regularized estimator, is shown to achieve asymptotically the information bound of the oracle estimator. A coordinate-wise algorithm is developed for finding the grid of solution paths for penalized hazard regression problems, and its performance is evaluated on simulated and gene association study examples.;In the third section we defined new, sparse, nonparametric Cox's hazard model and constructed a general class of group penalties suitable for structured variable selection and estimation in exponentially growing parameter space designs. Regression spline expansion of hazard function was used to construct partial likelihood for right censored survival data. Family of folded group penalties is introduced to support sparsity, with family of nonconvex penalty functions acting across groups, and with family of bridge penalties acting within groups. We provided finite sample sparse oracle inequalities with p >> n for empirical risk function. Exact rates of convergence are established under no correlation restriction in the covariate space. Moreover, we proposed groupwise path algorithm for a large body of group penalty functions.;The last section is dedicated for future work on methods for supervised or model driven normalization of microarray data. A new procedure is proposed whose theoretical properties are yet to be discussed. Simultaneous estimation of both important and unimportant sources of variation is proposed through a methodology that consists of three iterative steps. Rowwise marginalization is used to separate biological from technical variation and additivity of the array effects is used to separate the confounding of the two sources of variation via columnwise marginalization. | | Keywords/Search Tags: | Regularized, Oracle, Selection, Regression, Estimation, Used, Methods, Sparse | PDF Full Text Request | Related items |
| |
|