Font Size: a A A

Sparse canonical correlation analysis

Posted on:2009-03-11Degree:Ph.DType:Thesis
University:University of Toronto (Canada)Candidate:Parkhomenko, ElenaFull Text:PDF
GTID:2448390005954112Subject:Genetics
Abstract/Summary:
Large scale genomic studies of the association of gene expression with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is Canonical Correlation Analysis, which determines linear combinations of all variables of each type with maximal correlation between the two linear combinations. However, in high dimensional data analysis, when the number of variables under consideration exceeds tens of thousands, linear combinations of the entire sets of features may lack biological plausibility and interpretability. In addition, insufficient sample size may lead to computational problems, inaccurate estimates of parameters and non-generalizable results. These problems may be solved by selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combinations of variables of each type. However, available methods providing sparse solutions, such as Sparse Principal Component Analysis, consider each type of variables separately and focus on the correlation within each set of measurements rather than between sets. We introduce new methodology---Sparse Canonical Correlation Analysis (SCCA), which examines the relationships of many variables of different types simultaneously. It solves the problem of biological interpretability by providing sparse linear combinations that include only a small subset of variables. SCCA maximizes the correlation between the subsets of variables of different types while performing variable selection. In large scale genomic studies sparse solutions also comply with the belief that only a small proportion of genes are expressed under a certain set of conditions. In this thesis I present methodology for SCCA and evaluate its properties using simulated data. I illustrate practical use of SCCA by applying it to the study of natural variation in human gene expression for which the data have been provided as problem 1 for the fifteenth Genetic Analysis Workshop (GAW15). I also present two extensions of SCCA---adaptive SCCA and modified adaptive SCCA. Their performance is evaluated and compared using simulated data and adaptive SCCA is applied to the GAW15 data.
Keywords/Search Tags:SCCA, Sparse, Correlation, Linear combinations, Data, Variables
Related items