Font Size: a A A

Recognition Of Protein-coding Genes And Genomic Analysis Of Prokaryotic And Eukaryotic Genomes

Posted on:2005-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L ChenFull Text:PDF
GTID:1100360122982211Subject:Biophysics
Abstract/Summary:PDF Full Text Request
The fast increasing pace of human and other model organism genome-sequencing projects have provided us a large quantity of genome data, which leads to a great need for automatic genome annotation. One of the important tasks of annotation is to recognize protein-coding genes in prokaryotic and eukaryotic genomes. This paper describes some new approaches for recognizing protein-coding genes in bacterial and archaeal, coronavirus and eukaryotic genomes by using the Z curve method.The first part of the paper introduces the development of bioinformatics and the progress of computational gene-finding algorithms. The Z curve theory, which is the basic tool in analyzing prokaryotic and eukaryotic genomic sequences in this paper, is also presented in this section. The second part proposes some algorithms in the recognition of protein-coding genes in prokaryotic genomes. Since false positive prediction always exists in the annotation of microbial genomes, it is essential to confirm which ORF is coding and which is not. Starting from the known genes in the annotation file, we describe a method based on Z curve theory to recognize protein-coding genes in questionable ORFs. The average recognition accuracy of 57 bacterial and archaeal genomes is greater than 99%. A computer program, ZCURVE_C, has been developed and website service is provided. We also find that the genomic GC content of bacterial and archaeal genomes is more important than phylogenetic lineage in gene recognition. Finally, a new program to recognize genes in coronavirus genomes, especially suitable for SARS-CoV genomes, has been proposed. The improved system, ZCURVE_CoV 2.0, can predict the cleavage sites of viral proteinases in coronavirus polyproteins. The third part analyzes the genome structure of Arabidopsis thaliana and develops an ab initio eukaryotic gene recognition program. Using a windowless technique based on the Z curve method, the isochore structure of Arabidopsis thaliana genome has been explored. The position and size of a mitochondrial DNA insertion isochore has been precisely predicted. Its amino acid usage and codon preference shows different properties with genes in other regions. Furthermore, a new ab initio gene-finding software for eukaryotic organisms, Zcurve_E, has been proposed in this section. The new algorithm addresses global statistical features of protein-coding sequences by taking the frequencies of bases at three codon positions into account. Consequently, it gives better consideration to both typical and atypical cases. Compared with other gene-finding software, the present program has the merits of simplicity, universality and reliability. Joint applications of Zcurve_E with Genscan, which is probably the best software currently available for gene recognition in eukaryotic genomes, may lead to better results over any individual program.
Keywords/Search Tags:Z curve, Bacterial and archaeal genomes, Gene recognition, SARS-CoV, Genomes, Isochore, Eukaryotic genomes
PDF Full Text Request
Related items