Font Size: a A A

Prediction Of Protein Coding Genes And Promoters Based On Sequence Characteristics

Posted on:2007-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2120360242461962Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Identification of protein coding genes and related promoters becomes a crucial problem. However, the biologic methods hardly tackle the whole problem with the explosion of genomic sequences. The computational prediction of protein coding genes and related promoters becomes an issue of consequence. The study of predicting protein coding genes and related promoters based on sequence characteristics is presented.Firstly, implemented a computational system for predicting promoters and related transcription start sites (TSSs). A logitlinear model is designed to model the promoters for effectively integrating the proximal promoter information and the different sequence characteristics of promoters with different distances to the TSS. On the basis of the promoter model, an advanced system called ProKey to locate TSSs and promoters in mammalian genomes was developed. The system was evaluated on the whole human and mouse genome. The comparison of the ability to predict TSSs with leading programs, DGSF and Eponine, demonstrated that the prediction accuracy of ProKey is significant higher than that of the well known programs, DGSF and Eponine.Furthermore, implemented a computational system for predicting protein-coding genes. By analyzing sequence characteristics of protein-coding genes, the complicated problem of predicting several protein-coding genes in eukaryotic DNA sequence containing multiple genes was decomposed into a series of sub-problems at several levels with decreasing complexity, including the gene level, the element level, and the feature level. On the basis of this decomposition, a multilevel model for the prediction of protein-coding genes was created. Based on the multilevel model, a dynamic programming algorithm was designed to search for optimal gene structures from DNA sequences, and a new program GeneKey for the prediction of vertebrate protein-coding genes was developed. Testing results with widely used datasets demonstrate that the prediction accuracies of GeneKey at the nucleotide level, exon level and gene level are all higher than that of the well known program GENSCAN.Finally, investigated the relationship of the C+G content of sequences and protein-coding genes. The results demonstrate that the sequence characteristics of protein-coding genes are correlated to the C+G content of sequences. For CG-poor genes, the prediction accuracy could be improved prominently, when CG-poor genes are utilized to train the model for prediction.
Keywords/Search Tags:genomic sequence, protein-coding gene, promoter, multilevel optimization
PDF Full Text Request
Related items