Font Size: a A A

Evaluation Of Gene Structure Prediction Programs And Prediction Of Translation Initiation Sites

Posted on:2008-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2120360272468390Subject:Bio-IT
Abstract/Summary:PDF Full Text Request
Computational gene structure prediction, which is valuable for finding new genes and understanding the composition of genomes, plays an important role in various kinds of genome projects. For eukaryotic gene structures, many gene structure prediction programs have achieved remarkable prediction accuracies on widely used datasets. However, a large number of new protein coding genes have been experimental validated since these datasets were constructed, and the statistic results of gene structure features show significant differences while comparing with those results that previous reported. The issue of re-evaluating the accuracy of gene structure prediction programs is widely concerned within the field of computational gene prediction. The study of a genome-wide prediction and analysis of human protein coding genes is presented.We present a comprehensive evaluation of several representive gene prediction programs on new dataset (BEN). The results showed the prediction accuracy is significantly lower than that previous reported. In addition, the prediction accuracies of gene sequences with low C+G contents and gene structure features, such as long intron, short exon, translation initiation sites and so on, are relative low. The relationship between the prediction accuracy at exon level and the length of exon is further analyzed, and it is found that the prediction accuracy on very short exons(<25bp) whose length is a multiple of three is significant higher than those whose length is not a multiple of three.To solve the weakness of the detection of translation initiation sites (TISs), a computational program TISKey is implemented. The features that have been widely used for predicting TISs are further analyzed, and it is found that some features of TISs and non-TISs are heavily dependent on the C+G content of sequences around AUG codons, and some features are quite different for non-TISs located in untranslated regions and coding regions considering different reading frames. Further, the strategy of using multiple support vector machines to fully make use of the information is proposed, and a new program TISKey for the prediction of TISs is developed. Testing results on widely used dataset demonstrate that TISKey could get better prediction accuracy. TISKey can be accessed via http://bioinfo.hust.edu.cn.
Keywords/Search Tags:eukaryote, prediction of gene structure, prediction of protein coding region, translation initiation site, support vector machine
PDF Full Text Request
Related items