Font Size: a A A

Improving Gene Structure Prediction By Combining Multiple Sources Of Evidence

Posted on:2008-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1100360242464049Subject:Genetics
Abstract/Summary:PDF Full Text Request
The Human Genome Project (HGP) is a sign that we have entered an "omic" era in molecular biology field. To date, the determination of genome sequences of approximate 2,000 organisms has been sequenced or is ongoing. The first stage for interpreting and annotating the genomic data is to list the protein-coding genes and determine the exact exon-intron structure for every gene.There are many sources that can support evidence for annotating genomes, including the expressed sequence tags (EST), homologous proteins, computational gene predictions and the conservation among the closely organisms. The evidence from multiply sources is complementary and conflictive for the genome annotation. Although some model species have been annotated by the manual curators, the method is time-consuming and money-costing, and limited to annotate the genomes of model species. Therefore, the computational gene finding as the only solution has been carried out to produce an initial annotation, especially for most newly-sequenced species. The computational gene predictions have been made well progress in the last few years in terms of both methods and prediction accuracy measure, but the task still remains a significant challenge, especially for eukaryotes in which coding exons are usually separated by introns of vary length. The current gene predictors can produce results with a number of false positives when implementing in large genomic sequences. Moreover, computational gene finding in newly-sequenced genomes is especially difficult task due to the absence of a training set which is composed of abundant validated genes.In this thesis, we present a based-score method for predicting eukaryotic gene structures by combining multiply evidence generated from a diverse set of sources. The evidence includes the predictions of the four leading ab initio gene finders (Genscan, Augustus, Fgenesh and Geneid) and alignments to EST and protein databases. At first, the raw scores of evidence are transformed by the nonparametric estimation methods to the probabilistic ones that can reflect the likelihood that the evidence is correct. We tested the four methods (experience distributing, segment linear function, kernel density estimating and local polynomial regress), showing that local polynomial regress is the best method for score transformation. The evidence is then integrated and normalized by Dempster-Shafer theory of evidence and vote algorithm. Lastly, the normalized evidence is combined into a frame-consistent gene model by using dynamic programming. As dynamic programming is an unsupervised method, it can be used to predict genes in newly-sequenced organisms.Based on the models and algorithm described above, a computational program was designed, named as SCGPred (Score-based Combinational Gene Predictor). SCGPred was written as Perl language, and is open source based on GNU license. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with three datasets composed of large DNA sequences from human (the 22th chromosome and ENCODE sequence set) and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in contrast to the best of ab initio gene predictors. We also demonstrate that SCGPred can improve significantly prediction in novel genomes by combining several foreign gene finders with similarity alignments, and is superior to other unsupervised methods. As a result, SCGPred can be served as an alternative gene-finding tool for newly-sequenced eukaryotic genomes.Besides coding proteins, there is a large class of genes that code microRNAs. MicroRNAs, an abundant class of tiny non-coding RNAs, have emerged as negative regulators for translational repression or cleavage of target mRNAs by the manner of complementary base paring in plants and animals. By searching short complementary sequences between transcription factor open-reading frames and intergenic region sequences, and considering RNA secondary structures and the sequence conversation between the genomes of Arabidopsis and Oryza sativa, we detected 96 candidate Arabidopsis microRNAs. These candidate microRNAs were predicted to target 102 transcription factor genes that are classified as 28 transcription factor gene families, particularly those of DNA-binding transcription factor families, which imply that microRNAs might be involved in complex transcriptional regulatory networks for specifying individual cell types in plant development.
Keywords/Search Tags:gene prediction, computational gene finding, genome annotation, combiner methods, supervised machine learning, unsupervised machine learning, microRNA, comparative genomics
PDF Full Text Request
Related items