Font Size: a A A

Topology Recognition By Profile-HMM Based On A Novel Classification And Structure-Alignment

Posted on:2009-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:W K RenFull Text:PDF
GTID:2120360242494164Subject:Biophysics
Abstract/Summary:PDF Full Text Request
With the increasing of protein sequence in the bio-macromolecule database, development of new methods to extract structural information from amino acid sequences becomes an important research topic in the post-genome era. More and more evidences shows that the number of natural protein folds is limited, usually from hundreds to thousands, which is much less than the number of DOF obtained by proteins. The Anfinsen's principle suggests that protein's structure is mostly determined by its sequence. While the structural database tending to completeness, the problem of structural analysis becomes the one of fold recognition, which is, finding the best-matching three dimensional structural fold. Systematic research of those folds is meaningful to uncover the principle of protein folding, to provide structural annotation for large protein database, or helping for precise protein structural prediction.Currently, protein fold recognition mostly depends on experts, and different database has different principle. By means of observation, SCOP classifies proteins based on homology, however, for some fold, it is difficult to construct fold recognition model since their secondary structure and its strike direction does exist difference. The classification of topology in CATH is based on the similarity score of sequence and structure alignment, which dose not directly show the similarity existing in protein secondary structure and its space assignment. In fact, protein fold type reflects the topology of protein core, which contains three aspects of protein space structure: element of secondary structure, relative assignment of SSE in sequence and entire route relationship of polypeptide chains (means direction of polypeptide chains).Based on modern protein fold research and the conservative of protein domain topology, we reclassify protein domains from three aspacts: the assignment, the direction characteristics, and the connection relationship of protein SSE. Finaly, a database named LIFCA was built, which formed the base of protein fold recognition. A significant aspect of fold recognition is to develop new algorithm. For modern research, there are mainly three kinds: pair comparison between amino acid sequences (e.g. checking the sequences similarity by means of Blast and Fasta), model construction based on multiple sequences alignment (e.g. Profile HMM method) and classification machine (e.g. NN, SVM). Compared with pair comparison, HMM could construct uniform model and extract the core of multiple homologous sequences, thus it has better recognition result to these sequences which do not exist high similar template in those known databases. In addition, although classification machine such as SVM could obtain higher accuracy, profile HMM has some unreplaceable merits, such as more uniform framework, keeping informations of conservative locus, detailed statistical analysis of amino acids in sequences, etc. Also for profile HMM, sequence model could be simply obtained by a multiple alignment, which is more suitable to further analysis and research. The main work of this paper includes the following:1. Establish LIFCA database based on the topologies of folding cores Choose 2,406 protein sequences from Astral with sequence identity 25% or low.The mainlyα, mainlyβ, andα/βstructure class are included in. Then reclassify those protein domains based on the study of protein folding, which means the SSE contents, their arrangement orientation and connections. This work laid the foundation for further research.2 Structure-based sequence alignments within topologiesFor each topology, a structure-based sequence alignment are conducted, the difference within each topology is researched too. The multiple alignment results from this step are used for model building.3. Profile HMM libraryThere are 74 representative topologies which contain no less than 4 members in LIFCA, so totally 74 Profile HMM model are establishing. Using Astral1.65 100% identity sequence database for test datasets, the classification accuracy is 74.5%, still maintain a low false positive rate than other identification methods, the Profile HMM library performance better in most topologies.In this paper, data sets and algorithm has been improved both, the hidden Markov model library based on this method gets a broader coverage and a good accuracy rate. For related research work, it's valuable.
Keywords/Search Tags:Protein, topology recognition, topology, Hidden Markov Model, structure alignment
PDF Full Text Request
Related items