Font Size: a A A

Research On Prediction Of Protein-protein Interaction Sites

Posted on:2017-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z S WeiFull Text:PDF
GTID:1310330542954957Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Proteins are the material foundation of life action.Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells.The interaction between two proteins is dominated by several residues in each other,which are called protein-protein interaction sites.Identification of residues participating in such interactions improves our understanding of molecular mechanisms.Also,because some diseases are closely related to the interactions between specific proteins,identification of involved residues can facilitate the development of therapeutic drugs.On account that experimentally identifying protein-protein interaction sites is labor-intensive and time-consuming,it is urgent to predict protein-protein interaction sites with a simple and effective computational method.Hence,protein-protein interaction sites prediction has become a hot research area in computational biology.Because of complexity and diversity of protein-protein interactions,protein-protein interaction sites prediction,especially from protein sequences,is still a challenging problem.Under the above background,research was made on application of machine learning in protein-protein interaction sites prediction,focusing on the prediction from protein sequences.In this paper,existing computational methods were summarized,and then a key scientific problem(i.e.class imbalance)to be solved was put forward.Aiming to deal with this problem,Three classification methods were proposed,which were applied to protein-protein interaction sites prediction.The main work in this paper can be summarized as follows:(1)Progress was reviewed in computational prediction of protein-protein interaction sites,and then a general procedure was concluded of protein-protein interaction sites prediction.According to feature sources utilized,prediction methods were categorized into sequence-based and structure-based ones and were described respectively.As a conclusion on each kind of method,general strategies were summed up to achieve a better prediction performance.At last,a problem of class imbalance was presented in training of protein-protein interaction sites predictors,which is also a problem that needs to be solved for applying machine learning methods.(2)A cascade random forests ensemble method was proposed.Aiming at class imbalance,it was proposed to combine sampling and classification ensemble with a cascade structure.In this method,sampling and classifier training were carried out alternately.Firstly,a balance training dataset was obtained by sampling and used to train a random forests model.Then,all samples were evaluated by the trained model.Based on evaluated scores,some easy samples of majority class were eliminated.On remained training samples,sampling,model training and sample elimination were executed repeatedly until remained training samples were balanced.Finally,these trained random forests models were ensembled with a cascade structure.Elaborate experiments on benchmark datasets demonstrated that the proposed method is effective to deal with class imbalance,and the trained predictor outperformed state-of-the-art methods.In addition,solvent accessible was found to be the most discriminating feature from analysis of feature importance.(3)An SVM and sample-weighted random forests ensemble method was proposed.This method combined cost-sensitive learning and classifier ensemble to relieve class imbalance problem,and to improve performance of protein-protein interaction sites prediction.Based on evaluation by a pre-trained SVM model,different cost weights were given to each sample.In this process,sample weights sums for both two classes were kept to be nearly equal.These sample weights were then used to train a sample-weighted random forests model.With the above strategy,this method made trained models to avoid class imbalance problem,meanwhile improved effect of classifier ensemble.In addition,a novel feature representation was proposed to represent a residue effectively with a lower dimension.Results on benchmark datasets demonstrated that the proposed method relieved class imbalance effectively and made significant improvement on protein-protein interaction sites prediction.Experiments on analysis of feature importance showed the effect of the proposed feature representation and the higher discrimination of solvent accessible than other features.(4)A solvent accessibility sampling based ensemble method was proposed.This method divided samples into multiple subsets using a simple strategy based on solvent accessible,and then carried out sampling in each subset.With this strategy,it relieved information loss in solvent accessible caused by random sampling.Furthermore,an ensemble of classifiers trained on several subsets by this sampling method was executed to improve performance.Experimental results on benchmark sets showed better performance of the proposed sampling method than random sampling and validated performance improvement of the trained predictor.
Keywords/Search Tags:protein-protein interactions, interaction sites prediction, sequence-based prediction, class imbalance, classifier ensemble
PDF Full Text Request
Related items