Font Size: a A A

Research And Application Of Classification Algorithms In The Prediction Of Protein-Protein Interactions

Posted on:2011-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:C K XuFull Text:PDF
GTID:2120360308957400Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Protein is the major executor of life activities. The activities of life are achieved through protein-protein interactions. Only if studying the function of protein in a systematic level, we can completely reveal the essence of life activity and the molecular mechanism of life phenomenon. This leads to the birth of proteomics, which is the large-scale and systematic study of structures and functions of all proteins in a cell or biological tissue, and their relationships with other molecules. So the study of protein-protein interaction (PPI) is one of the most hot spot in proteomics.Since the traditional experimental methods are usually laborious, and also usually yield false positive and false negative results, the computational methods is becoming more and more important to make prediction about protein-protein interaction. The research of this thesis focuses mainly on applying the classification algorithms of machine learning to predict protein-protein interaction, based on the information of primary sequence of protein. Because there is an axiom in the protein science that the primary sequence of protein specifies its structure, and its structure dominates its function, only the information of primary sequence could be sufficient to predict PPI. And this approach is much more universal, because it only requires the information of primary structure of protein. The main contents in this thesis are as follow:(1) A novel predicting method for PPI is proposed based on the improved pseudo amino acid composition (PseAA) feature extraction algorithm. Because different functions of proteins are dominated by several different attributes of amino acids, properties of amino acids related to PPI should be integrated to represent the information of protein sequence effectively and efficiently. Firstly, PseAA feature extraction method based on Geary autocorrelation function is used to evaluate the relativity between amino acid properties and protein interaction; then according to the results of evaluation, relevant properties are selected to integrate together by another novel PseAA feature extraction method based on Minkowski function, and random forest is adopted as classifier for learning and prediction. The result obtained in the experiment of Helicobacter pylori PPI dataset indicates that our method is better than traditional methods and improves the accuracy.(2) A random forest classifier combined with n-Diad is constructed to predict the protein-protein interaction. The n-Diad is proposed for abstracting information from protein sequence, which takes both synonymous mutation and hydrophobic effect into account. And random forest is chosen as classifier because of its high generalization ability. Furthermore, the performance of prediction model is also influenced by the quality of training dataset. Therefore, the interaction data is extracted from the PPI dataset of Saccharomyces cerevisiae from DIP. Meanwhile, different four criteria based on principles of biology are adopted to construct non-interaction dataset in order to evaluate the effects of these criteria. Since integrating different biological evidences about non-interaction, the ScoNeg dataset should be more biological significant. And the prediction model using it achieves the best performance with a high accuracy of prediction.(3) An improved K-Nearest Neighbor (KNN) classifier with Moran-PseAA feature extraction is proposed to predict the protein-protein interaction. A novel Moran-PseAA feature extraction algorithm is exploited to code protein sequence, which uses Moran autocorrelation function to involve the information of sequence order. KNN is adopted as classification engine. Considering the characteristic of protein interaction, a novel distance function is proposed to calculate the similarity of two pairs of protein pairs. Then the PPI samples of Saccharomyces cerevisiae are extracted from the DIP to construct training dataset for learning and testing. And the classification model yields a better accuracy of prediction.(4) A novel method for prediction of protein-protein interaction is proposed based on the properties of hot spot residues distributed on protein surface and the co-evolution of interacting proteins. Since the pressure of natural selection, the interacting proteins are co-evolution during the process of evolution. Therefore, co-Diad feature is exploited for representing the co-evolution information between two proteins. And only a small number of hot spot residues on protein interface account for the majority of the binding energy required by physically interaction of two proteins. Hence components of feature vector that involve the information of hot spot amino acids are related to PPI and efficient to classification. So multi expression programming (MEP) is adopted as classifier, which could select important and efficient features during the procedure of classification model construction. The hot spot residues could have many types and single MEP classifier could only extract features that involves some types of hot spot residues and lose information of other types. So a number of MEP classifiers are integrated as an ensemble classification model. This approach tested on the Saccharomyces cerevisiae protein interaction dataset achieves good prediction results.
Keywords/Search Tags:Protein-protein interactions, Pseudo-amino acid composition, Random forest, K-nearest neighbors, Multi expression programming
PDF Full Text Request
Related items