Font Size: a A A

Statistical techniques for examining gene regulation

Posted on:2005-12-21Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Jensen, Shane TylerFull Text:PDF
GTID:2458390008997074Subject:Biology
Abstract/Summary:
Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to certain target genes. These short segments have a conserved appearance, which is called a motif. The experimental determination of TF binding sites is expensive and time-consuming. Many motif-finding programs have been developed but no program is clearly superior in all situations, making it difficult to judge which of the motifs predicted by these algorithms is biologically relevant.This thesis provides a review of previous approaches to the problem of motif discovery. We derive a comprehensive scoring function based on a full Bayesian model, which can handle unknown site abundance, unknown motif width, and two-block motifs with variable-length gaps. In addition, this scoring function formulation enables us to objectively compare different predicted motifs and select the optimal ones, effectively combining the strengths of existing programs.An algorithm, BioOptimizer, is proposed to optimize a scoring function, thereby reducing noise in the motif signal found by any motif-finding program. The accuracy of BioOptimizer, when used in conjunction with several existing programs, is shown to be superior to any of these motif-finding programs alone when evaluated by simulation studies and real-data applications in bacteria.We then propose a Bayesian hierarchical clustering model for the common structure between a set of discovered motifs. This clustering model is implemented, using a Gibbs sampling strategy, on a dataset of 116 TF motifs and several approaches to analyzing the clustering results are discussed. A Uniform clustering prior is also considered and is compared to the Dirichlet process prior. Our clustering strategy is general enough to be appropriate and useful in a variety of other statistical settings.Finally, our techniques for motif discovery and motif clustering are used in combination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences from several closely related species are used to discover motifs conserved by evolution, and these conserved motifs are then used to cluster genes together into putative co-regulated groups. This clustering is validated and examined in detail using several external measures of cell regulation.
Keywords/Search Tags:Clustering, Several
Related items