Font Size: a A A

Statistical models for protein sequence analysis

Posted on:2004-10-14Degree:Ph.DType:Thesis
University:University of MichiganCandidate:Qian, BinFull Text:PDF
GTID:2460390011467029Subject:Biophysics
Abstract/Summary:
Rigorous computation methods are needed to unleash the power hidden in the DNA and protein sequences that are generated from various genome projects. This thesis deals with the development and refinement of statistical models associated with protein sequence analysis.; The affine gap penalties generally used in pair-wise sequence alignment represent the relative ease of extending a gap compared with initializing a gap, but they are obvious over-simplifications of the real insertion and deletion processes that occurred during sequence evolution. In order to improve the efficiency of sequence alignment and to obtain a better understanding of protein evolution, we extracted the probability of gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity less than 25%) using alignments based on their structures. We observed a gap distribution that can be fitted with a multi-exponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.; In light of the fact that most score functions used in pair-wise alignment are designed to find homologs in the various databases rather than to generate accurate alignments between known homologs, we optimized a score function for the purpose of generating accurate alignments, as evaluated using the percentage of correctly aligned residues comparing to structurally aligned gold standards.; We also designed a phylogenetic tree based hidden Markov model (T-HMM) to build profiles for protein families. Profile hidden Markov models excel at capturing the common statistical features of a group of biological sequences, but they ignore the evolution relationship between the sequences. We introduced a method to incorporate phylogenetic information directly into hidden Markov models, and demonstrate that the resulting model performs better than most of the current multiple sequence based methods for finding distant homologs. We also used this method to generate common features of G-protein coupled receptors (GPCRS) based on either their ligand binding or G-protein coupling preference. The profile generated by T-HMM gives high accuracy in GPCR classification.
Keywords/Search Tags:Protein, Sequence, Models, Statistical, Hidden
Related items