Font Size: a A A

Recognizing Short Coding Sequences Of Human Genes And Predicting Proteinase Cleavage Sites In Polyproteins Of Coronaviruses

Posted on:2005-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2120360122987737Subject:Biophysics
Abstract/Summary:PDF Full Text Request
ABSTRACT Currently, available genome data are increasing exponentially with thecompletion of more and more genome sequencing projects. Driven by this explosionof genome data, computational gene recognition programs are critical for theautomatic annotation of such a large amount of uncharacterized DNA sequences. Since the early 1980s, there has been great progress in the development ofcomputational gene-finding algorithms. Some problems, however, have not yet beensolved currently. Recognizing short genes in prokaryotes and short exons ineukaryotes is one of such problems. The paper is devoted to assessing variousalgorithms, including those currently available and the new ones proposed here, inorder to find the best algorithm to solve the issue. Based on the databases and astandard benchmark, 19 algorithms were evaluated. Consequently, the Z curvemethods with 69 and 189 parameters are the best ones among them, based on thedatabases constructed here. In addition to the highest recognition accuracy confirmedby 10-fold cross-validation tests, the Z curve methods are much simplercomputationally than the second best one, the fifth-order Markov chain model, inwhich 12,288 parameters are used. Recently, we have developed a coronavirus-specific gene-finding system,ZCURVE_CoV 1.0. In this paper, the system is further improved by taking theprediction of cleavage sites of viral proteinases in polyproteins into account. Based onthe method of traditional positional weight matrix trained by the peptides aroundcleavage sites, the present method also sufficiently considers the length and numberconservation of non-structural proteins cleaved by the 3C-like proteinase andpapain-like proteinase to reduce the false positive prediction rate. The improvedsystem, ZCURVE_CoV 2.0, has been run for each of the 24 completely sequencedcoronavirus genomes in GenBank. Consequently, all the non-structural proteins in the24 genomes are accurately predicted. Compared with known annotations, theperformance of the present method is satisfactory. The software ZCURVE_CoV 2.0 isfreely available at http://tubic.tju.edu.cn/sars/. Since the function of a protein is closely correlated with its subcellular location,prediction of protein subcellular locations from the sequence of amino acids has great IIABSTRACTimportance in the field of bioinformatics today. In this paper, a dataset of eukaryoticproteins with known locations was constructed from the SWISS-PROT database.Based on the dataset constructed here, some proteins (such as transport protein orimmune globulin) were found to localize to different sites at different stages of thecell cycle or under different conditions.
Keywords/Search Tags:the Z curve, eukaryotic genomes, gene recognition, SARS-CoV genomes, polyprotein, cleavage sites, subcellular location
PDF Full Text Request
Related items