Font Size: a A A

Unsupervised and semi-supervised training methods for eukaryotic gene prediction

Posted on:2009-12-31Degree:Ph.DType:Thesis
University:Georgia Institute of TechnologyCandidate:Ter-Hovhannisyan, VardgesFull Text:PDF
GTID:2444390002493843Subject:Biology
Abstract/Summary:
This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing.;Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species which possess with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semisupervised training approach is designed for eukaryotic species with small number of introns.;The results indicate that the unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments.;The analysis of novel genomes led to interesting biological findings and showed that several of fungal species are either over-annotated or under-annotated.
Keywords/Search Tags:Gene, Methods, Eukaryotic, Training, Species, Unsupervised
Related items