Font Size: a A A

Integrating related data sets to improve inference in computational biology

Posted on:2009-06-23Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Fan, XiaodanFull Text:PDF
GTID:1448390005950204Subject:Biology
Abstract/Summary:
Biological systems are generally too complex to be fully characterized by a snapshot from a single viewpoint or at a single condition. Modern high-throughput experimental techniques are used to collect massive amounts of data to interrogate biological systems from various angles or on diverse conditions. Coupling with this trend, there is a growing interest in statistical methods for integrating multiple sources of information in an effort to improve statistical inference and gain deeper understanding of the systems. This dissertation presents data integration approaches in several computational biology problems. The main focus of these works is the development of hierarchical models, efficient Bayesian algorithms for computation, and systematical evaluation of their statistical power.The first chapter introduces the trend toward data integration in computational biology, together with a brief literature review. The second chapter presents a Bayesian meta-analysis approach for integrating multiple microarray time-course data sets to detect cell cycle-regulated genes. A new Metropolis-Hastings algorithm was designed to achieve fast convergence of MCMC in the scenario of pooling multiple data sets. A model comparison approach was used for classification and power evaluation. The third chapter provides another approach for detecting cell cycle-regulated genes, where the problem is formulated as parallel model selection with hierarchical Structure. Reversible jump MCMC was used to do dynamic model selection. A new procedure for proposal construction improved the mixing property of reversible jump MCMC, which made it feasibility for high-dimensional problems. In the fourth chapter, we discuss several basic problems in comparative genomics studies, where multiple genomes are combined for detecting functional elements. As an effort to direct future comparative genomics study, the phylogenetic HMM model was used to analyze the power of detecting conserved elements in various settings. We also present an empirical study on the conservation of transcriptional factor binding sites. It serves as a check of the conservation assumption and a clue for future integrated approach for genome annotation.
Keywords/Search Tags:Data sets, Computational, Integrating, Approach
Related items