Font Size: a A A

Detecting cis-regulatory modules by modeling correlated structures in genomic sequences

Posted on:2007-08-24Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Zhou, QingFull Text:PDF
GTID:1440390005968696Subject:Biology
Abstract/Summary:
Gene transcription is regulated by interactions between transcription factors (TFs) and their DNA binding sites. The common binding pattern of one TF is called a motif. The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules (CRMs) composed of binding sites of multiple TFs. I propose a hierarchical mixture approach to model the cis-regulatory module structure by considering the co-localization of multiple transcription factor binding sites (TFBS's) to the same module. Based on the model, a de novo motif-module discovery algorithm, CisModule, is developed for Bayesian inference about module locations and within-module motif sites. I have applied this approach to the characterization and discovery of novel CRMs that drive gene expression in muscle development in Ciona savignyi.; Furthermore, evolutionary constraints among TFBS's in related species provide an independent piece of information for the identification of motifs. To combine in de novo motif discovery the two pieces of information contained in module structure and cross-species orthology, I develop a coupled hidden Markov model (c-HMM) where in each species the hidden states indicate locations of motif sites and modules, and the hidden states in different species are coupled through multiple alignment. Background nucleotides and TFBS's are assumed to follow different evolutionary models. Inference on this model is based on a Markov chain Monte Carlo sampling of CRMs and their component motifs simultaneously from their joint posterior distribution in the sequence context of multiple species. This method has been tested on biological data sets, where known CRMs have been annotated. Significant improvement by this method over other module discovery and phylogenetic motif discovery methods is observed. Further applications of this method are illustrated through a case study of mouse tissue expression data.; In addition to the development of suitable models for the inference of CRMs, this dissertation also contributes significantly to the development of efficient computational algorithms. Specifically, recursive summation and backward sampling via dynamic programming are derived to allow the efficient sampling of the conditional distributions from these models.
Keywords/Search Tags:Model, Module, Binding sites, Cis-regulatory
Related items