Font Size: a A A

Predicting eukaryotic genes by integrating evidence

Posted on:2010-10-30Degree:Ph.DType:Thesis
University:University of PennsylvaniaCandidate:Liu, QianFull Text:PDF
GTID:2440390002980975Subject:Biology
Abstract/Summary:
Nowadays large and complex genomes can be quickly sequenced at a low cost thanks to ever faster and cheaper genome sequencing technology. However it still remains a significant challenge to annotate the content of newly sequenced genomes, and it is especially difficult to accurately identify the exact exon-intron structures of protein-coding genes. Although computational gene prediction has seen steady improvement over the last two decades, prediction accuracy is far from satisfactory, especially on complex genomes. As a variety of computational and high-throughput experimental gene evidence has become available, gene prediction could potentially benefit greatly from combining these data. In the meanwhile it presents a big challenge to develop general and flexible computational frameworks that can effectively incorporate different types of data and reconcile possibly inconsistent even conflicting evidence to infer consensus gene structures. To address the challenge, in this thesis we propose new models based on machine learning for eukaryotic gene prediction by integrating multipe sources of evidence with the goal of improving prediction performance.;First we developed Evigan, an eukaryotic gene predictor that integrates multiple sources of evidence. In Evigan, a Dynamic Bayesian Network (DBN) is designed to model the joint distribution of observed evidence sources and hidden consensus gene parses, where parameter estimation and decoding are handled by the Expectation-Maximization algorithm and the Viterbi algorithm, respectively. Evigan can incorporate various types of evidence, including prediction from multiple gene finders, EST matches, protein hits, transcript alignment and splice site prediction, and can be easily extended to accommodate other types of evidence. Evigan was applied to annotate several species ( Homo sapiens, Arabidopsis thaliana, Plasmodium vivax and Caenorhabditis elegans). The experiments show that Evigan outperforms any individual input data used as evidence source, and is a flexible framework to integrate multiple sources of evidence.;We then explored as additional information becomes available how it can be utilized to further improve prediction performance in a simple post-processing step following initial gene prediction. Specifically, in an initial pass of prediction Evigan is extended to produce the K-best candidate gene models for each gene locus. Candidate gene models are then reranked by a reranker using additional evidence such as comparison with putative orthologous genes identified from a closely related reference species. The reranker takes as features conservation in splice site position, sequence composition and co-occurrence of signal peptide, and its parameters are estimated from data using a large margin learning algorithm. The reranker was applied to annotate Drosophila melanogaster with Drosophila pseudoobscura as reference species, showing that the reranker improves performance over Evigan's original best gene models.;In addition to predicting a single transcript for a gene, we also sought to predict alternative splicing and alternative transcripts, by further analyzing Evigan K-best alternative gene models. Starting from Evigan K-best gene models as hypotheses for alternative transcripts, we carefully analyzed the relationship between K-best models' posterior probabilities defined by Evigan and alternative splicing on human and Drosophila melanogaster. We found that K-best models' posterior probabilities can be used as a signal to suggest whether a gene has alternative splicing and its alternative transcripts. Criteria inspired by these findings were applied to Toxoplasma gondii, producing a list of hopeful genes with alternative splicing.
Keywords/Search Tags:Gene, Evidence, Alternative splicing, Prediction, Evigan, Eukaryotic
Related items