Computational Prediction Of Protein-protein Interactions And Hot Spot Residues In Protein Interfaces

Posted on:2011-10-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J F Xia

Full Text:PDF

GTID:1100360305966752

Subject:Bioinformatics

Abstract/Summary:

PDF Full Text Request

With the complement of the sequencing human and other species genome, the study of biology has been gradually transferred from the genomics era to the post-genomics era. As one of the most important field of post-genomics era, proteomics developed by focusing on the study of all possible protein-protein interactions (PPIs) in a cell has become the hot topic and fronter of life science. The studies of PPIs can help us to understand essential mechanisms of life processes.So far a number of computational methods have been explored for the large-scale prediction of PPIs. Among these methods, a unique category of protein sequence-based prediction methods attracted much attention. The accuracy and reliability of these methods do not depend on the prior information of the protein pairs. Due to the limited availability of three dimensional structures of proteins and the rapid increase of the number of protein sequences, the approaches that use amino acid sequence information alone to guide the discovery of PPIs are of particular interest. Therefore, the current study is to seek machine learning techniques such as support vector machine (SVM) and multiple classification system to predict PPIs from sequences. In addition, we also introduce an ensemble learning method with SVMs to predict hot spot residues, which are observed to be crucial for preserving protein function and maintaining the stability of protein association. The main works in this thesis can be introduced as follows:1. A sequence-based approach was proposed to predict PPIs by combining a new feature representation using autocorrelation descriptor with rotation forest. Autocorrelation descriptor accounts for the interactions between amino acid residues within a certain distance apart in the sequence, so this descriptor adequately takes the local environments of amino acids effect into account and makes it possible to discover patterns that run through entire sequences. The amino acid sequences were firstly translated into numerical values representing six physicochemical properties, and then these numerical sequences were converted into a serious of fixed-length vectors by autocorrelation descriptor. Finally, the rotation forest was constructed using these vectors as input. Rotation forest is a newly proposed robust ensemble system, which can enhance the accuracy and the diversity for single classifiers in the ensemble simultaneously. Experimental results on Saccharomyces cerevisiae and Helicobacter pylori datasets show that our proposed approach outperforms those previously published in literature, which demonstrates the effectiveness and efficiency of the proposed method.2. A method based on novel representation of local protein sequence descriptor and SVM was presented to infer PPIs. One particular feature of protein interaction is that the interactions usually occur in the discontinuous regions in the protein sequence, where distant residues are brought into spatial proximity by protein folding. In the current study, a novel representation of local protein sequence descriptor was used to involve the information of interactions between distant amino acids in the sequence. A protein sequence was characterized by ten local descriptors of varying length and composition. So this method is capable of capturing multiple overlapping continuous and discontinuous binding patterns within a protein sequence. As expected, the experimental results show that our SVM-based predictive model with this encoding scheme is an important complementary method for PPIs prediction.3. A public meta predictor was constructed to infer PPIs using only the information of protein sequence. Besides the foregoing two feature representation methods (i.e. autocorrelation descriptor and local descriptor), additional four methods were selected according to their prediction accuracy in previous studies. We then built six sequence-based individual classifiers by combining different feature representation methods and SVMs. Finally, we adopted another SVM as the meta predictor to integrate the prediction decision values of these excellent component predictors. The results demonstrated that our meta predictor is promising. In addition, we used the final prediction model trained on the PPIs dataset of S.cerevisiae to predict interactions in other species. The results reveal that the meta model is also capable of performing cross-species predictions.4. A feature-based method that combines protrusion index with solvent accessibility was presented for accurate prediction of hot spots in protein interfaces. Up to now, the biological properties that are responsible for hot spots have not been fully understood. Consequently, the features previously identified as being correlated with hot spots are still insufficient. We first extracted a wide variety of features from a combination of protein sequence and structure information. And then we performed feature selection to remove noisy and irrelevant features, and thus improved the performance of the classifier. After extensive feature selection, nine individual-feature based predictors were developed to identify hot spots using SVMs. Finally, we employed an ensemble classifier approach, which further improved prediction accuracies of hot spots. To demonstrate its effectiveness, the proposed method was applied to two benchmark datasets. Empirical studies show that our method can yield significantly better prediction accuracy than those previously published in the literature.

Keywords/Search Tags:

Protein-protein interactions, Protein sequence, Ensemble learning, Rotation forest, Support vector machine, Autocorrelation descriptor, Local descriptor, Hot spot, Protrusion index, Solvent accessibility

PDF Full Text Request

Related items

1	Predicting Protein Protein Interactions And Its Active Sites Based On Data Mining Algorithm
2	Prediction Research Of Protein-Protein Interaction Based On Ensemble Of Support Vector Machine And Random Forest
3	Identification of interface residues involved in protein-protein and protein-DNA interactions from sequence using machine learning approaches
4	Research On Prediction Of Protein-protein Interactions Based On Deep Neural Network And Ensemble Learning
5	Predicting Protein-Protein Interactions Based On Support Vector Machine And Complete Protein Sequence
6	Predicting Protein-protein Interactions From Protein Sequence Based On Multiple Feature Extractions
7	Prediction Of Protein Solvent Accessibility Based On All-atom Encoding
8	The Study Of Protein Amino Acid Residues' Solvent Accessibility Prediction And Gene Expression Profile Analysis
9	Research On Predicting Protein-protein Interactions Based On Machine Learning
10	Prediction Of Protein-protein Interactions Based On Multi-information Fusion