Font Size: a A A

Prediction-based genome annotation, domain assignment methods, and their applications in structural genomics

Posted on:2005-02-02Degree:Ph.DType:Thesis
University:Columbia UniversityCandidate:Liu, JinfengFull Text:PDF
GTID:2450390008980108Subject:Health Sciences
Abstract/Summary:
The explosion of sequence information in post-genomic era is increasingly widening the gap between the number of protein sequences deposited in public databases and the experimental characterization of these proteins. Computational biology plays a central role in bridging this gap.; In this thesis, I analyzed more than sixty completely-sequenced proteomes using various computational methods. The structural and functional annotations for each protein in the proteomes have been made publicly available through the database PEP. Systematic comparison of different proteomes resulted in several interesting findings regarding evolution. For example, bacteria seemed to have smaller fractions of proteins responsible for communication than multi-cellular organisms. The sequence analysis on genomic scale also led to the discovery of a class of proteins that have long regions of NO-Regular Secondary Structure (NORS) regions and appear to play significant functional roles. NORS proteins are much more abundant in eukaryotes, evolutionarily conserved, important in protein-protein interaction, and over-represented in proteins with regulatory and transcription-related functions.; I have also contributed to the target selection for Northeast Structural Genomics Consortium (NESG) and established an automatic target selection procedure for the consortium. My study revealed that structural genomics might have to target about 48% of all proteins and 52% of residues in the currently known proteomes. I estimated that it might be necessary to experimentally determine over 40,000 structures to minimally cover five eukaryotic proteomes. I also demonstrated that sequence clustering must begin with protein domains and developed two sequence-based domain assignment methods. CHOP, a homology-based method, was able to dissect 70% of proteins into domains-like fragments. Two results stood out from this comprehensive and still preliminary analysis of structural domains in entire proteomes: (1) over 70% of all dissected proteins contained more than one fragment, and (2) the number of CHOP fragments in the protein correlated linearly with length of the protein. Since not all proteins could be dissected by CHOP into structural domains, I developed a new method that predicts domains from sequence based on neural network, ChopNet. It correctly predicts the number of domains for 55% of all proteins and domain boundary positions for 49% of two-domain proteins.
Keywords/Search Tags:Proteins, Structural, Domain, Methods, Sequence
Related items