Statistical methods for analysis of structured genomic data

Posted on:2012-07-23

Degree:Ph.D

Type:Dissertation

University:University of Pennsylvania

Candidate:Chuai, Shaokun

Full Text:PDF

GTID:1460390011465140

Subject:Biology

Abstract/Summary:

Partially motivated by analysis of high dimensional genomic data, high dimensional statistics, especially high dimensional regression analysis, have been an active research area in the last decades. Besides high dimensionality of the genomic data, another important feature is that the genomic data often have certain structure such as time course measurements and group or graphical structures. How to incorporate such structure information into analysis of numerical data raises interesting statistical challenges. This dissertation develops statistical methods for two problems motivated by genomic data analysis. The first problem is related to variable selection for high dimensional varying coefficients models, where we develop a regularization method for variable selection and estimation. We use basis function expansion to model the time-dependent regression coefficient functions and a combination of smoothness and group-level penalty to achieve both smooth function estimation and coefficient function selection. We apply the methods for analysis of microarray time course gene expression data in order to identify the transcription factors that regulate expression changes over times. Our results show that the varying coefficients model provides better power in identifying the relevant transcription factors than simple time-wise analysis. The second problem considers variable selection for graph-structured group variables, where we assume that the variables are grouped and also have a graphical structure. Such examples include genes in a collection of pathways and single nucleotide polymorphisms (SNP) in genes. We introduce a new penalty that is a combination of group Lasso and a graph-constrained smoothness penalty within groups in order to perform both group-level variable selection and to impose some smoothness of the regression coefficients with respect to the graph structures. Simulation results have shown that the new method gives better variable selection and also prediction when such group and graphical structure information exists. We apply this method to analysis of two real data sets: an analysis of a glioblastoma gene expression data to identify several KEGG pathways that are potentially related to survival time of glioblastoma; and an analysis of a SNP data to identify genes that are associated with patient HDL level.

Keywords/Search Tags:

Data, High dimensional, Structure, Variable selection, Method, Statistical

Related items

1	Variable Selection Methods In Statistical Models For Survival Data
2	Research On Copula Modeling Method And Statistical Inference Based On Vine
3	Structure Identification,Variable Selection And Robust Estimation For Some Semiaparametric Models With High Dimensional Complicated Data
4	Variable Selection And Feature Screening In High-dimensional Data
5	Variable Selection Problems Using Bayesian Method And Graph-constrained Regularization For Analysis Of High-dimensional Genomic Data
6	The Parameter Estimation And Variable Selection In High Dimensional Collinearity Models
7	Variable Selection Method For High Dimensional Data
8	Robust Estimation And Variable Selection Of Two Kinds Of Semi-parametric Models Under High Dimension Data
9	Variable Selection For High-Dimensional Gene Data
10	Variable Screening For Statistical Models With Ultrahigh Dimensional Data