Protein sequence constraints

Posted on:2010-03-16

Degree:Ph.D

Type:Dissertation

University:University of Virginia

Candidate:Lavelle, Daniel Thor

Full Text:PDF

GTID:1440390002478328

Subject:Chemistry

Abstract/Summary:

To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino-acid words in proteins, we compared the frequencies of 4- and 5-amino-acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. While the human proteome has many over-represented clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, clump counts from non-redundant Pfam-AB sequences are well described by random models; from 1.9% (MC(0) model) to 0.1% (window shuffled model) of 4 amino-acid word clumps are 2-fold over-represented. Likewise, using 5-residue clumps from a structural 10-letter alphabet, from 4.7% (MC(0) model) to 0.5% (window shuffled model) of words are 2-fold over-represented in Pfam-AB. Using a false discovery rate q-value analysis, the number of exceptional 4- or 5-letter words in real proteins compared with random sequence models is similar to the number found when comparing words from one random model to another. Consensus over-represented words are not enriched in conserved regions of proteins, but 4- and 5-letter words are enriched in a-helical secondary structures (1:18 (i+1) to 1:61-fold (i+2)). To test whether local secondary structure sequence preferences constrain protein sequences as a whole, we examined the 9-letter binary (Hydrophobic/Polar) word clumps found in a structurally distinct, non-homologous library of topologs from Cath version 3.1 (CATH-T) and compared them to counts generated by random models based only on amino-acid frequency data ("sequence-only") or amino-acid frequencies in secondary structures ("structure-informed"). Statistically exceptional 9-letter binary (H/P) clumps were identified by q-value false discovery rate analysis. Only 12% and 14.5% of the 512 possible words in CATH-T proteins are significantly over- and under-represented, respectively, when compared to window shuffled random sequences. However, when word clumps associated with alpha-helices, beta-strands, and loops are examined separately, a MC(2) model that preserved tri-residue frequencies for alpha-helical regions fit the 9-residue clump data best. Most 9-letter words can be well described by short tri-residue frequencies. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.

Keywords/Search Tags:

Protein, Sequence, Words, Random, Secondary, Amino-acid

Related items

1	The Relationship Between The Amino Acid Residues Of Context And The Secondary Structure Of Protein Of Target Sequence
2	Investigating the sequence patterns in the secondary structure of proteins
3	Distributed Representation Of Amino Acids And Applications To Protein Sequence Analysis
4	A Study On The Protein Secondary Structure Prediction And The Connection Between Protein Secondary Structure And Its 3D Structure
5	Amino Acid Sequence Characterization, Features Selection And Its Application
6	Prediction Of The Amino Acid Sequences Critical For Regulating Protein Phase Separation
7	Protein Secondary Structure Prediction Based On The Hidden Markov Model
8	A Study On Protein-Protein Interaction Prediction Based On CGR And Random Forests
9	Protein Secondary Structure Prediction Question Research Based On Neural Network
10	Modeling Proteins For Their Ground State Conformations And Their Secondary Structure Predictions