Prediction Of Protein-DNA Binding Hotspot Residues Based On Features Selected By SHAP

Posted on:2023-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2530306797964839

Subject:Biomathematics

Abstract/Summary:

PDF Full Text Request

Proteins and DNAs are involved in many life processes through interaction,including gene expression,DNA assembly and repair,etc.It is found that the binding free energy of protein-DNA interaction is mainly provided by a small number of binding amino acid residues(hotspot residues),so the research and identification of protein-DNA binding hotspot residues is of great significance.It can not only help to elucidate the mechanism of protein-DNA interaction,but also help to treat the related diseases caused by the disorder of protein-DNA interaction.Although it is possible to identify protein-DNA binding hotspot residues by alanine scanning mutation assay,this method is time-consuming and expensive.Therefore,the development of efficient and accurate computational methods for identifying protein-DNA binding hotspot residues can play a supplementary and guiding role in the experiment.Although several methods have been developed to predict hotspot residues at the protein-DNA interface,their generalization ability is difficult to be effectively verified due to the small data set used.In this study,we first collected some protein-DNA binding hotspot residue data from the db AMEPNI database,and then mined some protein-DNA binding hotspot residue data not recorded in the db AMEPNI database from the latest published literature.The two parts of the data were integrated to obtain a larger dataset of alanine mutation effects at the protein-DNA interaction interface.Then based on the three-dimensional structure of the protein,we calculated the physicochemical and structural features of the residues,and obtained a high-dimensional feature matrix containing 117 dimensions.In order to obtain features with higher correlation with protein-DNA interaction hotspot residues and improve the prediction efficiency and accuracy of the model,we used the SHAP feature selection method to select the optimal feature subset from the high-dimensional feature matrix.With the optimal feature subset as the input data,we used three different machine learning algorithms including SVM,XGBoost and RF to build the prediction model.Through the above work,the results of this study are as follows:(1)the results of5-fold cross-validation on the training dataset show that the support vector machine(SVM)based model achieves the best accuracy.(2)The prediction results of the independent test set show that the model has strong stability and high generalization ability.(3)The comparison with existing calculation methods shows that our model achieves the highest MCC value(0.2837)and AUPRC value(0.5241).These results demonstrate that our method outperforms existing predictors,and promises to be an effective protein-DNA binding hotspot residue prediction tool,which provides assistance in various aspects such as the design of related drug targets.

Keywords/Search Tags:

Protein-DNA interaction, Hotspot residue, Feature extraction, SHAP, Machine learning

PDF Full Text Request

Related items

1	Predicting protein-protein interactions, interaction sites and residue-residue contact matrices with machine learning techniques
2	Research Of Protein-Protein Interaction Extraction Based On Rich Feature And Multiple Kernels Learning
3	Protein-Protein Interaction:Simple Prediction Tool Development And Studies On Specific Cases
4	Feature Extraction And Deep Learning Method For Protein Inter-residue Interaction Prediction
5	Research On Machine Learning Based Protein Class And Protein-ligand Interaction Prediction
6	Prediction Of Protein-DNA Interaction Hotspots Based On Neural Network
7	Research On Machine Learning-based Protein-Protein Interaction Extraction
8	Protein-Protein Interaction Extraction Based On Combinational Learning And Active Learning
9	The Prediction Of Protein Interactions Based On Integrated Learning Model
10	A Study On Feature Extraction And Classification Algorithms For Protein Structural Class Prediction