Font Size: a A A

Computational prediction of essential genes, and other applications of bioinformatics to genome annotation

Posted on:2008-10-31Degree:Ph.DType:Thesis
University:Yale UniversityCandidate:Seringhaus, Michael RolfFull Text:PDF
GTID:2440390005959091Subject:Bioinformatics
Abstract/Summary:
The large-scale identification and characterization of genes is an important challenge. Hundreds of genomes have now been sequenced; the next step is discerning which regions encode functional products. This is often achieved with a mix of computational and experimental techniques. Three such techniques---prediction of essential genes, largescale transposon mutagenesis, and tiling microarrays---are the focus of the bioinformatics research presented here.;Essential genes are necessary for basic survival: disruption of even one is lethal to an organism. The ability to identify such genes in pathogens is understandably useful for drug design. Predicting essential genes in silico is particularly appealing because it circumvents expensive and difficult experimental screens. To date, most such prediction has concentrated on homology comparison to other species. This thesis presents a bioinformatics approach that employs characteristic features of a gene's sequence to estimate essentiality, and offers a promising way to identify antimicrobial drug targets in unstudied organisms.;A machine-learning classifier was trained on known essential genes in the model yeast Saccharomyces cerevisiae, and applied to the closely-related but relatively unstudied yeast Saccharomyces mikatae. The resulting predictions aligned well with homology-based estimates, and a subset was verified with in vivo knockouts in S. mikatae..;Next, the question of feature choice was addressed. Given an unstudied pathogen and the goal of identifying essential genes, are functional genomics assays worth performing, or will sequence data suffice? Three different feature classes (sequence-based, sequence-derived, and experimental data) were assessed alone and in combination with a simple machine learner. The amalgamated feature set recovered the highest rate of true-positive predictions, whereas functional genomics data alone returned the highest ratio of true positives to false positives. The results suggest that experimental data is indeed valuable; but if unavailable, complementary sequence features perform nearly as well.;Also presented here are bioinformatics approaches to characterize transposon insertion bias on a genomic scale, and optimize the performance of whole-genome tiling microarrays through the inclusion of mismatch oligonucleotides.;Together, these studies present an effective method to identify essential genes, and demonstrate the applicability of bioinformatics techniques to current issues in genome annotation.
Keywords/Search Tags:Genes, Bioinformatics
Related items