Font Size: a A A

Statistical methods for analysis of structured genomic data

Posted on:2012-07-23Degree:Ph.DType:Dissertation
University:University of PennsylvaniaCandidate:Chuai, ShaokunFull Text:PDF
GTID:1460390011465140Subject:Biology
Abstract/Summary:
Partially motivated by analysis of high dimensional genomic data, high dimensional statistics, especially high dimensional regression analysis, have been an active research area in the last decades. Besides high dimensionality of the genomic data, another important feature is that the genomic data often have certain structure such as time course measurements and group or graphical structures. How to incorporate such structure information into analysis of numerical data raises interesting statistical challenges. This dissertation develops statistical methods for two problems motivated by genomic data analysis. The first problem is related to variable selection for high dimensional varying coefficients models, where we develop a regularization method for variable selection and estimation. We use basis function expansion to model the time-dependent regression coefficient functions and a combination of smoothness and group-level penalty to achieve both smooth function estimation and coefficient function selection. We apply the methods for analysis of microarray time course gene expression data in order to identify the transcription factors that regulate expression changes over times. Our results show that the varying coefficients model provides better power in identifying the relevant transcription factors than simple time-wise analysis. The second problem considers variable selection for graph-structured group variables, where we assume that the variables are grouped and also have a graphical structure. Such examples include genes in a collection of pathways and single nucleotide polymorphisms (SNP) in genes. We introduce a new penalty that is a combination of group Lasso and a graph-constrained smoothness penalty within groups in order to perform both group-level variable selection and to impose some smoothness of the regression coefficients with respect to the graph structures. Simulation results have shown that the new method gives better variable selection and also prediction when such group and graphical structure information exists. We apply this method to analysis of two real data sets: an analysis of a glioblastoma gene expression data to identify several KEGG pathways that are potentially related to survival time of glioblastoma; and an analysis of a SNP data to identify genes that are associated with patient HDL level.
Keywords/Search Tags:Data, High dimensional, Structure, Variable selection, Method, Statistical
Related items