Font Size: a A A

Statistical and computational methods for molecular signature analysis with applications

Posted on:2006-03-21Degree:Ph.DType:Dissertation
University:Boston UniversityCandidate:Subramanian, AravindFull Text:PDF
GTID:1450390008973924Subject:Biology
Abstract/Summary:
With the advent of DNA microarrays, it is now possible to study the expression pattern of an entire genome in a single experiment. Genome-wide expression analysis has become a mainstay of modern genomics over the past decade, as reflected in the publication of more than 1000 papers per year. The challenge is no longer obtaining gene expression profiles, but rather interpreting the resulting data to gain insight into biological mechanisms. A common approach is simply to focus on a handful of genes at the top and bottom of a differentially expressed gene list and attempt to discern telltale clues from these highest-scoring genes. However, there are serious issues with this approach. One may find that no individual genes meet the threshold of statistical significance, because the relevant biological differences are small compared to the inherent noise in mRNA measurement with current microarrays. Alternatively, one may be left with a long list of statistically significant genes without any unifying theme. To address this problem we introduced a statistical methodology called Gene Set Enrichment Analysis (GSEA) for determining whether a given gene set is significantly enriched in a list of gene markers ranked by their correlation with a phenotype of interest. The method uses a random walk statistic to evaluate enrichment and permutation procedures to assess statistical significance. To apply this method we curated a large database of gene sets based on prior biological knowledge, transcriptional profiles from genetic and chemical perturbations, conserved sequence motifs and computational clustering of previous expression data. We provide a mathematical description of the GSEA algorithm and demonstrate its utility through several biological applications. Gene set based analysis of expression datasets revealed novel findings that were not apparent from traditional single-gene analysis. We demonstrate the broad applicability of GSEA through analysis of datasets from complex disorders such as diabetes, oncologic data from leukemia, p53, lung cancers and present a genomic signature based strategy of using animal models to probe human disease. We have created a software implementation and a catalog of gene sets that can be used for GSEA; both are freely available at http://www.broad.mit.edu/GSEA.
Keywords/Search Tags:Gene, GSEA, Statistical, Expression
Related items