Font Size: a A A

Sequence-Based Prediction Of Proteingdp/GDP Binding Sites

Posted on:2023-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:2530306818487504Subject:Computer technology
Abstract/Summary:PDF Full Text Request
GDP(guanosine diphosphate)and GTP(guanosine triphosphate)are a type of nucleotides that participate in most biochemical reactions in organisms,including DNA replication and transcription,transmembrane transport,muscle contraction,and various metabolic processes play an irreplaceable role.In most biological cell activities,proteins and nucleotides are required to combine with each other to play their roles.The identification of protein-nucleotide binding sites not only helps to explore the mechanism of intermolecular interactions,but also helps to effectively explain the pathogenesis of diseases,and provides help for drug discovery and design.Traditional research usually uses biological experiments to predict proteinnucleotide binding sites.Experimental methods are often costly,time-consuming,and difficult to popularize.Therefore,it is particularly important to use computationalbased methods for the study of protein binding sites.At the same time,the prediction of nucleotide binding sites in protein sequences is an imbalanced binary classification problem,because the number of GDP and GTP non-binding residues in protein sequences is much more than the number of binding residues.Therefore,it is necessary to use a sampling method to solve this problem.The main work of this paper on the prediction of protein-GDP/GTP binding sites is as follows:(1)Feature extraction of proteins.The position-specific iterative search algorithm was used for protein sequences to obtain the position-specific score matrix PSSM,and the logistic function was used to normalize the PSSM matrix.Feature vectors for each amino acid residue in a protein sequence are extracted using a variable sliding window based on mirrored residues.(2)Two sampling methods based on CNMW(Clustering Near Miss-2 Weighted)under-sampling and neighborhood cleaning under-sampling are studied.Based on CNMW under-sampling,K clustering is performed on the majority class samples to obtain K clusters,and each cluster is assigned a corresponding weight according to the Near Miss-2 distance,that is,the first weight of the majority class sample.Then from the global consideration of the sample,use the idea of the nearest neighbor to assign a weight to each sample in the sample set,that is,the second weight of the sample.At this time,each majority class sample in the sample set has two different weights.Multiply the two weights corresponding to the majority class samples to obtain a new weight,sort the new weights from large to small,and select the majority class samples as many as the minority class samples in this order,and form a new weight with the minority class samples.data set.Neighborhood cleaning and under-sampling For each data sample in the data set,select its three nearest neighbor samples to form a set M.For the non-binding site sample p,if at least two of M are binding site samples,remove them.p;for the binding site sample q,if there are more than two non-binding site samples in M,remove the non-binding site samples in M.(3)Two prediction models for protein-GDP and protein-GTP binding sites are proposed.Neighborhood cleaning under-sampling and SVM are combined into NCL_S prediction model,and CNMW_S prediction model is combined based on CNMW_S under-sampling and SVM.On the standard dataset,five-fold cross-validation experiments and independent testing experiments were performed for protein-GDP binding site prediction and protein-GTP binding site prediction to test the performance of the two prediction models.The experimental results show that both prediction models improve the prediction performance to a certain extent.
Keywords/Search Tags:protein-GDP/GTP binding sites prediction, position specific scoring matrix, under-sampling, sliding window, SVM
PDF Full Text Request
Related items