Font Size: a A A

Statistical methods for gene set based prediction and clustering analysis of microarrays

Posted on:2012-06-14Degree:Ph.DType:Thesis
University:University of MinnesotaCandidate:Li, RanFull Text:PDF
GTID:2468390011967622Subject:Biology
Abstract/Summary:
We study statistical methods for cancer microarray classification that incorporate the gene dependence to improve the sample prediction and sample clustering. For sample prediction, we proposed a composite likelihood based approach to group genes into tightly linked clusters. Compared to the commonly used Euclidean distance based clustering approach, the proposed likelihood based approach naturally incorporates both positive and negative dependence of genes. We compute a summary score for each gene cluster to explicitly capture the within cluster gene dependence, and apply the L1 penalized logistic regression model for simultaneous sample prediction and important gene cluster selection. The proposed method is motivated by and builds upon the widely used LASSO regression (Tibshirani, 1996) and the novel gene set based prediction method proposed by Park et al. (2007) that averaged expressions within a gene set to reduce the prediction dimensionality and complexity. Simulation studies and application to cancer microarray data are used to illustrate the competitive performance of the proposed method.;Due to the high dimensionality of large scale gene expression data, most existing clustering methods have treated genes independently for convenience of modeling and computation. In this thesis we explore model based clustering methods that explicitly account for gene interactions for improved sample clustering. One approach is to directly model all pairwise gene correlations, which often introduces too many parameters and compromises the model performance due to the small sample size typical of most microarray data. Instead we adopt an intermediate approach: we divide genes into blocks based on publicly available gene molecular function information, and choose to model gene interactions within each block and assume independence between blocks. Specifically we propose to model each block using a multivariate normal distribution with a structured covariance matrix based on principal components analysis. Overall we fit all genes using a product multivariate normal mixture model for sample clustering. To select informative genes for improved sample clustering, we adopt a lasso penalized likelihood estimation approach. We develop efficient covariance matrix computation algorithm based on principal components analysis, and penalized EM algorithm for model estimation based on iterative coordinate descent. Through simulation studies and applications to public microarray data, we illustrate the competitive performance of the proposed clustering method.;We also propose a gene set based linear discriminant analysis method to integrate gene pathway information into cancer microarray classification and select important genes through penalized likelihood methods. The proposed method used a block diagonal regularized covariance matrix to model gene interactions and efficiently computed the regularized model estimation. We demonstrate the competitive performance of the proposed method through simulation study and application to public microarray data...
Keywords/Search Tags:Gene, Method, Microarray, Prediction, Clustering, Model, Sample, Competitive performance
Related items