Font Size: a A A

Recognition Of Protein-coding Genes And Sequence Analysis Of Prokaryotic Genomes

Posted on:2006-05-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:F B GuoFull Text:PDF
GTID:1100360212989297Subject:Biophysics
Abstract/Summary:PDF Full Text Request
The fast increasing pace of the bacterial genome-sequencing projects leads to a need for automatic genome annotations. One of the most important tasks of annotation is to recognize protein-coding genes in genomes. This paper describes some new approaches for recognizing protein-coding genes in bacterial genomes using the Z curve method.The first part of the paper introduces the development of bioinformatics and the progress of computational gene-finding algorithms. The Z curve theory, which is the basis tool in analyzing prokaryotic genomic sequences, is also presented in this section.The second part proposes some algorithms in the recognition of protein-coding genes in prokaryotic genomes. 2694 ORFs originally annotated as genes in the genome of Aeropyrum pernix can be categorized into three clusters (A, B, C), according to their nucleotide compositions at three codon positions. A codingness index called AZ score is defined based on a clustering method to recognize protein-coding genes in the A. pernix genome. Consequently, the number of re-recognized protein-coding genes in the A. pernix genome is found to be 1610, which is significantly less than 2694 in the original annotation and also much less than 1841 in the RefSeq annotation curated by NCBI staff. Based on the Z curve theory of DNA sequences, an ab initio bacterial gene-finding program ZCURVE 1.0 is developed. After comprehensive comparison with Glimmer 2.02, ZCURVE 1.0 is found to have more accurate gene start prediction, lower additional prediction rate, and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. An ab initio virus and phage gene-finding program, ZCURVE_V 1.0 is also developed. Similar to ZCURVE 1.0, ZCURVE_V is also based on the Z curve theory. ZCURVE_V lays stress on the global statistical features of protein coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE_V, only 33 parameters are used to characterize the coding sequences. To have a fair comparison with the currently available software of similar function, GeneMark, a total of 30 viral genomes that have not been annotated by GeneMark are selected to be tested. Consequently, the average specificity of both systems is well matched, however, the average sensitivity of ZCURVE_V for smaller viral genomes (< 100 kb), which compose the main parts of viral genomes sequenced so far, is higher than thatof GeneMark. In addition, a self-training gene start prediction method GS-Finder is also developed.The third part analyzes the genome sequences of some bacteria. It is found that the genes located on the two strands of replication have separate base usage in Chlamydia muridarum, using the Z curve method. According to their positions in the 9-D space spanned by the variables u 1 ? u9, K-means clustering algorithm can classify about 94% of the genes into the correct strands. The base usage and codon usage analyses show that genes on the leading strand have more G than C and more T than A, particularly at the third codon position. For genes on the lagging stand the case is the contrary. The y components of the Z curves for the complete chromosome sequences show that the excess of G over C and T over A are much more in the above four genomes than in other bacterial genomes. The remarkable strand biases of G/C and T/A are proposed to be responsible for the appearance of separate base or codon usage in the four bacterial genomes. From the phylogenetic point of view, these four genomes group together. The base distribution patterns of DNA fragments in different regions in P. aeruginosa genome are also analyzed in this section. It's astonishingly shown that 5565 protein coding sequences, 17315 noncoding ORFs and 1104 intergenic sequences can be divided into several clusters according to their base distribution patterns. And almost all the protein coding sequences are contained in one cluster of them. The significantly different base frequencies at three codon positions, which arouse the division between the base distribution patterns of six reading frames of protein coding sequences, account for the astonishing clustering phenomenon.
Keywords/Search Tags:Z curve, bacterial and archaeal genomes, virus and phage genomes, gene recognition
PDF Full Text Request
Related items