Font Size: a A A

Prediction Of Protein-DNA Binding Hotspot Residues Based On Features Selected By SHAP

Posted on:2023-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2530306797964839Subject:Biomathematics
Abstract/Summary:PDF Full Text Request
Proteins and DNAs are involved in many life processes through interaction,including gene expression,DNA assembly and repair,etc.It is found that the binding free energy of protein-DNA interaction is mainly provided by a small number of binding amino acid residues(hotspot residues),so the research and identification of protein-DNA binding hotspot residues is of great significance.It can not only help to elucidate the mechanism of protein-DNA interaction,but also help to treat the related diseases caused by the disorder of protein-DNA interaction.Although it is possible to identify protein-DNA binding hotspot residues by alanine scanning mutation assay,this method is time-consuming and expensive.Therefore,the development of efficient and accurate computational methods for identifying protein-DNA binding hotspot residues can play a supplementary and guiding role in the experiment.Although several methods have been developed to predict hotspot residues at the protein-DNA interface,their generalization ability is difficult to be effectively verified due to the small data set used.In this study,we first collected some protein-DNA binding hotspot residue data from the db AMEPNI database,and then mined some protein-DNA binding hotspot residue data not recorded in the db AMEPNI database from the latest published literature.The two parts of the data were integrated to obtain a larger dataset of alanine mutation effects at the protein-DNA interaction interface.Then based on the three-dimensional structure of the protein,we calculated the physicochemical and structural features of the residues,and obtained a high-dimensional feature matrix containing 117 dimensions.In order to obtain features with higher correlation with protein-DNA interaction hotspot residues and improve the prediction efficiency and accuracy of the model,we used the SHAP feature selection method to select the optimal feature subset from the high-dimensional feature matrix.With the optimal feature subset as the input data,we used three different machine learning algorithms including SVM,XGBoost and RF to build the prediction model.Through the above work,the results of this study are as follows:(1)the results of5-fold cross-validation on the training dataset show that the support vector machine(SVM)based model achieves the best accuracy.(2)The prediction results of the independent test set show that the model has strong stability and high generalization ability.(3)The comparison with existing calculation methods shows that our model achieves the highest MCC value(0.2837)and AUPRC value(0.5241).These results demonstrate that our method outperforms existing predictors,and promises to be an effective protein-DNA binding hotspot residue prediction tool,which provides assistance in various aspects such as the design of related drug targets.
Keywords/Search Tags:Protein-DNA interaction, Hotspot residue, Feature extraction, SHAP, Machine learning
PDF Full Text Request
Related items