Font Size: a A A

Prediction Of Protein Interaction Sites Based On Imbalanced Data Set

Posted on:2021-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2480306743960639Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
The study of protein-protein interaction has very important biological significance.The interaction between protein and protein is an important part of the cell biochemical reaction network.The traditional experimental method is used to predict protein interaction sites.It is time-consuming and labor-intensive,and the prediction of protein interaction sites lacks certain accuracy.This article mainly deals with and predicts the imbalance in protein data.The main work of this paper is as follows:In the acquisition part of the data set,for 170 transient protein interaction pairs,some redundant protein chains were deleted in this paper,and 91 protein chains were finally obtained.Based on the evolutionary conservation of amino acids,we extracted4 features from the HSSP database,namely,residue space sequence,sequence information entropy,relative entropy,residue sequence weight,and extracted residues from the protein functional region recognition server(Consurf Serve).The base conservative score has a total of 5 features,and these 5 features are fused and re-encoded to get our final data set.In the prediction part of protein interaction sites,in view of the sample imbalance problem in the above data set,this paper proposes two sampling methods to deal with the problem of protein data sample imbalance.The first is based on the nearest neighbor rule and sample overlap.The problem deals with non-interface residues;the second is to sample the data based on the possibility of the sample being misclassified,and finally combine the XGBoost(Extreme Gradient Boosting)classifier to predict PPIs(Protein-Protein Interaction Sites).The second method achieves an accuracy of80.7%,and the sensitivity and MCC(Matthews Correlation Coefficient)reach 81.2%and 61.4%,respectively.Subsequently,based on the limited definition of protein interface residues,false negatives and false positives in the data,a data reorganization strategy for protein data is proposed.The interface residues were processed and predicted,and the accuracy and MCC value of the prediction results reached 94.73% and 84.47%,respectively.Compared with traditional experimental and calculation methods,the classification effect has been improved to a certain extent.
Keywords/Search Tags:Protein interaction sites, sample imbalance, conservative characteristics, XGBoost, limitation definition, data reorganization
PDF Full Text Request
Related items