Font Size: a A A

Similarity Analysis Of Protein Coding Genes And Its Impact On Gene Annotation In Prokaryotic Genomes

Posted on:2016-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ChenFull Text:PDF
GTID:2180330470950473Subject:Microbiology
Abstract/Summary:PDF Full Text Request
With the development of high-throughput sequencing technologies, theduplicated genes were found to be universal in genomes. Gene duplication can notonly increase the number of gene, but also provide materials for gene mutation andpositive selection. At the same time, it can provide possibility for biological evolution.Therefore, understanding the significance of the biological and evolution mechanismof duplicated genes is particularly important. At present, the research of duplicatedgenes in prokaryotic genomes is less than in eukaryotics genome, especially rarely formulti-copied genes. In this dissertation, duplicated genes in prokaryotic genomes werefurther analyzed firstly. On this basis, the multi-copied genes and its function has beensystemly researched for the first time, and aim to provide reliable data and theoreticalbasis for prokaryote evolution research in the future. In addition, gene annotation is animportant topic of the genome research. Protein-coding gene sequences are regardedas the training set in many gene annotation algorithms. However, many algorithmsdidn’t consider the similarity redundancy problem due to duplicated genes andmulti-copied genes in protein-coding genes sequence. The redundancy of datacollection is one of key influence factors in machine learning. The cuting offredundancy of sequence similarity has been widely applied in predicting proteinsequences. Therefore, in this paper, we analyse the influence of sequence similarity tothe results of gene annotation and aim to provide a reliable theoretical basis for genepredicted. Detailed contributions of this work can be summarized as follows.I. The98different GC content prokaryotic genomes were downloaded fromRefSeq database constructed data sets. CD-hit program was used to determine thesimilarity sequence with the threshold of80%and to cut-off redundant sequences.Then, the multi-copied genes were analysised in all genomes. The statistical resultsshow that the ratio of duplicated genes is0%~16.49%, the ratio of multi-copied genes is0~15.93%. Thefore, the results show that duplicated genes and multi-copiedgenes are widespread in prokaryotic genomes. The COG classification ofmulti-copied genes analysis shows that about87%of multi-copied genes belongs to"L". The function of multi-copied genes analysis shows that about71.4%ofmulti-copied genes related to coding enzymes. It shows that multi-copied genes arerelated to environmental adaptation.II. In order to study the influence of similarity gene sequences to gene prediction,we contrast genes prediction accuracy, the numbers of reannotation genes, thereliability of prediction genes results before and after redundant with Z-curvealgorithm and RPGM algorithm. The statistical results show that the three aspects areall distinct before and after redundancy. In addition, the correlation analysis ofsequence redundancy degree and disparity of the various evaluation parameters beforeand after redundancy shows that the two factors have a different degree of negativecorrelation. Therefore, the analysis results show that the influences of protein-codinggene sequences redundancy to gene annotation problem can’t be ignored.
Keywords/Search Tags:Prokaryotic genomes, Duplicated genes, Multi-copied genes, Sequencesredundancy, Gene prediction
PDF Full Text Request
Related items