Font Size: a A A

GARBASE---a database of wrongly annotated proteins

Posted on:2011-10-05Degree:M.SType:Thesis
University:University of Nebraska at OmahaCandidate:Pandey, SanjitFull Text:PDF
GTID:2448390002450634Subject:Bioinformatics
Abstract/Summary:
One of the many problems that exist in publicly available sequence database is the presence of wrongly annotated genes. These publicly available sequences and the associated annotation are used in computational methods involved in predicting genes on newly sequenced genomes. Such gene prediction is based on homology to previously annotated genes. Since the wrongly annotated genes in the past are also supported by homology, it results in continual propagation of wrong annotation, consequently affecting any homology based annotation of a newly sequenced genome. The objective of this project is to establish a balancing database of peptide sequences that have been called proteins, or part of proteins, but have been identified or reported to be not true. Here we report the development of a computational approach to collect evidence that can be used to determine confidence score for annotation of a protein. Using this framework and biological properties of proteins, namely the presence of conserved domain and gene order conservation, we have analyzed 85259 proteins from 26 Mycobacterium-genomes. The result from this analysis is populated into a prototype database (GARBASE), which consists of 19484 proteins that are potentially annotated incorrectly. Additionally, work is underway to populate this database with the results from the analysis of all the available genomes in the public repository, such as the GenBank. This will allow GARBASE to be a useful resource, when integrated into an automated genome annotation pipeline.;Keywords: Protein annotation, database, conserved domain, gene order, high performance computing.
Keywords/Search Tags:Database, Wrongly annotated, Proteins, Annotation
Related items