Font Size: a A A

Leveraging hidden correlations in high-dimensional biological data

Posted on:2009-06-26Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Aryee, Martin Joseph AnkrahFull Text:PDF
GTID:2448390005461237Subject:Biostatistics
Abstract/Summary:
Recently developed high-throughput technologies have facilitated the study of biological systems at a genome-wide scale. As a result it has become clear that obtaining the "parts list" encoded by DNA was only the first step towards a comprehensive understanding of biology. The present challenge is to understand the complex relationships between the components, especially how their coordinated orchestration can break down and lead to disease. These relationships can manifest in the form of correlations between the measured variables and are "hidden" in the sense that they cannot be directly measured. The correlations contain valuable information that can be leveraged to substantially improve efficiency in estimation of subtle effects. This thesis presents two applications that take advantage of the correlations inherent in high-throughput protein interaction and gene expression time-course data.;Chapter 1 proposes an optimized high-throughput computational and experimental strategy for identifying protein interactions. The strategy is based on the fact that protein binding is mediated by interactions between functional subunits. Correlations between these subunits can be estimated from the existing data to learn some of the "rules" that govern protein binding. We then use this information to make predictions about the likelihood of binding between the remaining protein pairs and prioritize them for screening. We show that this strategy can considerably increase the rate of new interaction discovery.;Chapter 2 develops an improved statistical method to identify differentially expressed genes in time-course gene expression data. In addition to the inherent complexity of genomic expression data due to gene-gene dependencies, time-course data sets present a further level of complexity in the serial correlation between measurements of the same gene at different time points. We can, however, leverage this additional layer of information to improve our ability to detect differentially expressed genes.;Chapter 3 presents an analysis of genetic determinants of susceptibility to tuberculosis, using the tools developed in Chapter 2. We explore a microarray gene expression data set comparing response to Mycobacterium tuberculosis infection in macrophages derived from resistant and susceptible inbred mouse strains. The time-course analysis reveals strain-specific differences in the transcriptional programs that determine the outcome of infection.
Keywords/Search Tags:Data, Correlations, Time-course
Related items