Computational Prediction Of Protein-protein Interactions And Hot Spot Residues In Protein Interfaces | | Posted on:2011-10-30 | Degree:Doctor | Type:Dissertation | | Country:China | Candidate:J F Xia | Full Text:PDF | | GTID:1100360305966752 | Subject:Bioinformatics | | Abstract/Summary: | PDF Full Text Request | | With the complement of the sequencing human and other species genome, the study of biology has been gradually transferred from the genomics era to the post-genomics era. As one of the most important field of post-genomics era, proteomics developed by focusing on the study of all possible protein-protein interactions (PPIs) in a cell has become the hot topic and fronter of life science. The studies of PPIs can help us to understand essential mechanisms of life processes.So far a number of computational methods have been explored for the large-scale prediction of PPIs. Among these methods, a unique category of protein sequence-based prediction methods attracted much attention. The accuracy and reliability of these methods do not depend on the prior information of the protein pairs. Due to the limited availability of three dimensional structures of proteins and the rapid increase of the number of protein sequences, the approaches that use amino acid sequence information alone to guide the discovery of PPIs are of particular interest. Therefore, the current study is to seek machine learning techniques such as support vector machine (SVM) and multiple classification system to predict PPIs from sequences. In addition, we also introduce an ensemble learning method with SVMs to predict hot spot residues, which are observed to be crucial for preserving protein function and maintaining the stability of protein association. The main works in this thesis can be introduced as follows:1. A sequence-based approach was proposed to predict PPIs by combining a new feature representation using autocorrelation descriptor with rotation forest. Autocorrelation descriptor accounts for the interactions between amino acid residues within a certain distance apart in the sequence, so this descriptor adequately takes the local environments of amino acids effect into account and makes it possible to discover patterns that run through entire sequences. The amino acid sequences were firstly translated into numerical values representing six physicochemical properties, and then these numerical sequences were converted into a serious of fixed-length vectors by autocorrelation descriptor. Finally, the rotation forest was constructed using these vectors as input. Rotation forest is a newly proposed robust ensemble system, which can enhance the accuracy and the diversity for single classifiers in the ensemble simultaneously. Experimental results on Saccharomyces cerevisiae and Helicobacter pylori datasets show that our proposed approach outperforms those previously published in literature, which demonstrates the effectiveness and efficiency of the proposed method.2. A method based on novel representation of local protein sequence descriptor and SVM was presented to infer PPIs. One particular feature of protein interaction is that the interactions usually occur in the discontinuous regions in the protein sequence, where distant residues are brought into spatial proximity by protein folding. In the current study, a novel representation of local protein sequence descriptor was used to involve the information of interactions between distant amino acids in the sequence. A protein sequence was characterized by ten local descriptors of varying length and composition. So this method is capable of capturing multiple overlapping continuous and discontinuous binding patterns within a protein sequence. As expected, the experimental results show that our SVM-based predictive model with this encoding scheme is an important complementary method for PPIs prediction.3. A public meta predictor was constructed to infer PPIs using only the information of protein sequence. Besides the foregoing two feature representation methods (i.e. autocorrelation descriptor and local descriptor), additional four methods were selected according to their prediction accuracy in previous studies. We then built six sequence-based individual classifiers by combining different feature representation methods and SVMs. Finally, we adopted another SVM as the meta predictor to integrate the prediction decision values of these excellent component predictors. The results demonstrated that our meta predictor is promising. In addition, we used the final prediction model trained on the PPIs dataset of S.cerevisiae to predict interactions in other species. The results reveal that the meta model is also capable of performing cross-species predictions.4. A feature-based method that combines protrusion index with solvent accessibility was presented for accurate prediction of hot spots in protein interfaces. Up to now, the biological properties that are responsible for hot spots have not been fully understood. Consequently, the features previously identified as being correlated with hot spots are still insufficient. We first extracted a wide variety of features from a combination of protein sequence and structure information. And then we performed feature selection to remove noisy and irrelevant features, and thus improved the performance of the classifier. After extensive feature selection, nine individual-feature based predictors were developed to identify hot spots using SVMs. Finally, we employed an ensemble classifier approach, which further improved prediction accuracies of hot spots. To demonstrate its effectiveness, the proposed method was applied to two benchmark datasets. Empirical studies show that our method can yield significantly better prediction accuracy than those previously published in the literature. | | Keywords/Search Tags: | Protein-protein interactions, Protein sequence, Ensemble learning, Rotation forest, Support vector machine, Autocorrelation descriptor, Local descriptor, Hot spot, Protrusion index, Solvent accessibility | PDF Full Text Request | Related items |
| |
|