Font Size: a A A

Protein-coding gene structure prediction using generalized hidden Markov models

Posted on:2004-04-09Degree:Ph.DType:Dissertation
University:University of California, Santa CruzCandidate:Kulp, David ClaytonFull Text:PDF
GTID:1458390011953221Subject:Computer Science
Abstract/Summary:
This paper describes a computer method for predicting the exon-intron structures and protein-coding regions of genes in genomic DNA along with its application to several model organisms. Expanding on earlier work applying linguistics and state machines to DNA analysis, the problem is introduced here using a novel generalized hidden Markov model that allows arbitrary length symbols per model state. This model provides a simple representation of complex grammatical structure and reduces some of the parameterization and training burden of standard hidden Markov models. A key characteristic of the method model topology. A computer program called "Genie" embodies the method described here. Employing mostly standard metrics for feature scoring, the basic gene-prediction method is shown to work better than other known methods, identifying as much as 40% of exact coding sequences correctly. An expanded method, which uses constraints on the set of possible outputs, allows for the incorporation of messenger RNA or protein sequence homology to boost gene prediction sensitivity and specificity by approximately 10%. The results of whole genome studies on several complete genome sequences are presented. Engineering details of the software design are discussed including flexible run-time configurations and methods to reduce the running time from cubic to linear in the size of the input sequence.
Keywords/Search Tags:Hidden markov, Method, Model
Related items