Font Size: a A A

Protein sequence constraints

Posted on:2010-03-16Degree:Ph.DType:Dissertation
University:University of VirginiaCandidate:Lavelle, Daniel ThorFull Text:PDF
GTID:1440390002478328Subject:Chemistry
Abstract/Summary:PDF Full Text Request
To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino-acid words in proteins, we compared the frequencies of 4- and 5-amino-acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. While the human proteome has many over-represented clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, clump counts from non-redundant Pfam-AB sequences are well described by random models; from 1.9% (MC(0) model) to 0.1% (window shuffled model) of 4 amino-acid word clumps are 2-fold over-represented. Likewise, using 5-residue clumps from a structural 10-letter alphabet, from 4.7% (MC(0) model) to 0.5% (window shuffled model) of words are 2-fold over-represented in Pfam-AB. Using a false discovery rate q-value analysis, the number of exceptional 4- or 5-letter words in real proteins compared with random sequence models is similar to the number found when comparing words from one random model to another. Consensus over-represented words are not enriched in conserved regions of proteins, but 4- and 5-letter words are enriched in a-helical secondary structures (1:18 (i+1) to 1:61-fold (i+2)). To test whether local secondary structure sequence preferences constrain protein sequences as a whole, we examined the 9-letter binary (Hydrophobic/Polar) word clumps found in a structurally distinct, non-homologous library of topologs from Cath version 3.1 (CATH-T) and compared them to counts generated by random models based only on amino-acid frequency data ("sequence-only") or amino-acid frequencies in secondary structures ("structure-informed"). Statistically exceptional 9-letter binary (H/P) clumps were identified by q-value false discovery rate analysis. Only 12% and 14.5% of the 512 possible words in CATH-T proteins are significantly over- and under-represented, respectively, when compared to window shuffled random sequences. However, when word clumps associated with alpha-helices, beta-strands, and loops are examined separately, a MC(2) model that preserved tri-residue frequencies for alpha-helical regions fit the 9-residue clump data best. Most 9-letter words can be well described by short tri-residue frequencies. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.
Keywords/Search Tags:Protein, Sequence, Words, Random, Secondary, Amino-acid
PDF Full Text Request
Related items