Font Size: a A A

Evaluation Of The Automated Annotation Results Of Gluconobacter Oxydans621H Geome

Posted on:2014-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:W XiaFull Text:PDF
GTID:2250330401954758Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the advent of very high throughput sequencing technologies, the generation of newDNA sequences from microbes is outpacing the annotation ability of all this data. Theannotation tasks rely inevitably on fully-automatic annotation pipelines, which may introduceand propagate inconsistent and incorrect gene annotations. In order to better understand thewhole genome, we need to know how reliable the annotations are. It is impossible to get theexact answer without laboratory experiments to verify the computational analyses. However,more and more accumulated knowledge about known proteins can help us infer the accuracyand suggest potential solutions to identify the problems which introduced by the currentgenome annotation process.The organism chosen for investigation in this work is Gluconobacter oxydans621H,which is able to partially oxidize carbohydrates through oxidative fermentation and can beused to produce vitamin C. The chromosome consists of2.7M base pairs with61%GCcontent, whose annotations produced by three widely used pipelines(IMG, RAST and JCVI)were comparatively analyzed. The three annotation results show a high agreement in proteingene calling, but the differences are also significant.The most remarkable difference is that there are a large number (670) of gene calls beingin partial agreement (with the same stop locations but different start locations) in theannotations. This is a common issue in bacterial genome annotaion. We developed a set ofempirical rules to classify the BLAST results for inferring true start sites of genes. With thishomology based strategy, it turns out that supportive evidences can be found for247suchgenes to choose the correct start sites from the multiple options.The three annotations predicted totally2787protein coding genes in which1686(60%)are entirely consistent, i.e., with exactly the same genomic sequence locations in allannotations for these genes. At the other end, there are431(15%) entirely inconsistent genecalls which are absent in at least one annotation. Examing these predicted genes by proteinhomology and conserved domain features, we estimate the lower limits of the numbers ofmissed genes for IMG, RAST and JCVI are65,15and18respectively.
Keywords/Search Tags:bacterial genome annotation, gene prediction, knowledge based gene finding, gene start site prediction, homology search
PDF Full Text Request
Related items