Font Size: a A A

Study On Protein-RNA Specific Binding Site And Structure Prediction By Deep Learning And Ensemble Learning

Posted on:2022-08-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:1480306764993299Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Protein-RNA recognition and interactions are closely related to a variety of life activities in organisms,such as gene expression and regulation,protein synthesis and virus replication.Currently,only about 3,000 protein-RNA complex structures are available in Protein Data Bank(PDB),which are far fewer than the corresponding amount(~ 860,000)that researchers predict to exist.The main difficulties in the experimental determination of protein-RNA complex structures are as follows: RNAs are quite unstable and flexible,and it is often difficult to get the complex crystal structures.Therefore,it is urgent to propose a reliable theoretical method for predicting the structures of protein-RNA complexes.Currently,the prediction of RNA binding sites on proteins,and the prediction of protein-RNA complex structures have become a hot spot in the fielf of bioinformatics.In this paper,we focus on the issues of predictions of protein-RNA binding sites and complex structures,and the main contents are summarized as follows:1)Based on our previous findings that protein conserved residues have a tendency to cluter together in space,we propose a new encoding scheme SNB-PSSM to incorporate evolutionary information of spatially neighbors of a target one,and apply it to the prediction of RNA-binding sites on proteins.The test on RB44 dataset demonstrates SNB-PSSM method achieves an evident improvement compared with the standard and smoothed PSSMs with MCC increasing by 39% and 19%,and ACC increasing by 4% and 13% respectively.Using a sliding window encoding scheme,the SNB-PSSM based structure window method performs better than the standard PSSM and smoothed PSSM based sequence window methods respectively,with MCC increasing by 13% and 6%,and ACC increasing by 5% and 13% respectively.Additionally,the tests on 7 big datasets indicate our method is superior to many classic methods Fast RNABind R,RNABind R v2,Bind N+,PPRInt,KYG and PRIP that use PSSM profile,physical and chemical properties or structure-based features to some extent.Furtherly,the tests on the bound/unbound proteins and experimental/modelled structures indicate that our method is not sensitive to protein secondary structural changes,and has a good robustness against the structural variation to some extent,which enables our method to be applied on binding site predictions for modelled structures.This work demonstrates considering evolutionary information of spatially neighboring residues can significantly improve RNA-binding site predictions and meanwhile suggests binding sites evolve spatially cooperatively to some extent.We believe the proposed method if combined with other sequence-,structure-and dynamics-derived features can be better used for the predictions of proteinprotein/nucleic acid binding sites,as well as protein catalytic sites and hot spots.2)Adopting the above evolutionary information encoding scheme SNB-PSSM,also considering the sequence,structure and dynamics features,we propose a sequencebased ab initio approach a PRBind(ab-initio Protein-RNA Binding site prediction)to predict RNA-binding residues in proteins,where the modelling method I-TASSER and a deep convolutional neural network model are utilized.In structure prediction,all the homologous templates with sequence identity?> 30% to the target are excluded from the template library in order to meet the general situation.The result shows that although the above process is performed,the majority of the I-TASSER models(82.6%)have a correct fold with TM-score no less than 0.5.The analysis on features’ contributions indicates that the sequence features are most important to the prediction,the dynamics features are also crucial,and the sequence and structure-based features are complementary in binding site prediction.The performance of a PRBind on the independent test set illustrates that our method can give a better prediction for the correctly modeled folds.And in addition a PRBind has a good robustness against the structural variations as long as the residue positions are approximately correct,which is mainly because the structural features used in a PRBind are at a coarse-grained level,not very sensitive to the refined 3-dimensional structures.Our method outperforms some classic sequence-based prediction servers Fast RNABind R,RNABind R v2,Bind N+ and PPRInt.This work is helpful for strengthening our understanding of protein-RNA recognition and interactions,and can be used for protein-RNA docking prediction and binding hot spot exploration in experiments.3)We propose an integrated method PRDCE of multiple machine learning methods to evaluate protein-RNA docking decoys and select the near native complex structures,which utilizes the physics-based energy terms,amino acid-nucleotide pairwise potential,relative solvent accessible surface area of the interface and topological characteristics.These features are learned by an integrated machine learning model XGBR(XGBoost Regression)to give a probability of a docking decoy being a near native one.Feature analyses indicate that long-range electrostatic attraction and repulsion,van der Waals attraction and repulsion,interface propensity,solvent accessibility of interface,short-range electrostatic attraction and clustering centrality play important roles in the discrimination of near native structures.And the long-range electrostatic attraction is the most important one for the distinguishment,which is consistent with the nature of the dominating role of elastrostatic interactions in proteinRNA interactions.Additionally,the test results show that the performance of the Support Vector Machine(SVM)model is better than those of Random Forest(RF)and Naive Bayes(NB)models,and the performance of the integrated XGBR model is better than any single model.At the same time,it is found that the structure clustering performed before scoring can improve the discrimination results.Compared with several existing scoring functions Rpve Score,ITScore-PR,DARS-RNP and QUASIRNP with good performance,PRDCE shows the best performance on both datasets including 35 cases and 18 cases respectively,and its success rate is higher than those obtained by the four scoring functions.Thus,the results indicate that the consideration of multiple features and the utilization of the integrated XGBR model play an important role in the performance improvements,and also indicate the use of machine learning methods to evaluate the docking decoys has some advantages over the traditional scoring functions.This work is helpful to the development of protein-RNA docking methods,and can strengthen the understanding of protein-RNA interaction mechanism.In summary,the work mainly focuses on the issues involved in the prediction of RNA binding sites on proteins and protein-RNA complex structure prediction,and the major innovations include: 1)A novel evolutionary information encoding scheme SNBPSSM is proposed,where the evolutionary information of the spatial neighbors of target residues is considered.Its application on protein-RNA binding site prediction shows a good effect;2)A new effective deep learning based,ab inito prediction method a PRBind for RNA binding sites on proteins is developed,where the convolutional neural network(CNN)and the structure modelling method I-TASSER,as well as multiple featues including SNB-PSSM,and structure and dynamics based ones are utilized.a PRBind outperforms some classic sequence based prediction servers;3)An integrate method PRDCE of multiple machine learning methods is constructed for evaluating and screening the protein-RNA docking decoys,where the featues including different enegetic iterms,topological characteristics and solvent accessibility are comprehensively considered.Compared with the traditional statistical potential based and multiple linear aggression models,the method PRDCE has some advantages to some extent.These studies above strengthen the understanding of protein-RNA specific interaction machasiams,can provie valuable tools and information for the experimental researchers working on the related problems,and give a usful guidance for the structure-based nucleic acid drug design.
Keywords/Search Tags:Protein-RNA interactions, Binding site prediction, Deep learning, Molecular docking, Ensemble leaning
PDF Full Text Request
Related items