Sequence-Based Prediction Of Proteingdp/GDP Binding Sites

Posted on:2023-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:J H Wang

Full Text:PDF

GTID:2530306818487504

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

GDP(guanosine diphosphate)and GTP(guanosine triphosphate)are a type of nucleotides that participate in most biochemical reactions in organisms,including DNA replication and transcription,transmembrane transport,muscle contraction,and various metabolic processes play an irreplaceable role.In most biological cell activities,proteins and nucleotides are required to combine with each other to play their roles.The identification of protein-nucleotide binding sites not only helps to explore the mechanism of intermolecular interactions,but also helps to effectively explain the pathogenesis of diseases,and provides help for drug discovery and design.Traditional research usually uses biological experiments to predict proteinnucleotide binding sites.Experimental methods are often costly,time-consuming,and difficult to popularize.Therefore,it is particularly important to use computationalbased methods for the study of protein binding sites.At the same time,the prediction of nucleotide binding sites in protein sequences is an imbalanced binary classification problem,because the number of GDP and GTP non-binding residues in protein sequences is much more than the number of binding residues.Therefore,it is necessary to use a sampling method to solve this problem.The main work of this paper on the prediction of protein-GDP/GTP binding sites is as follows:(1)Feature extraction of proteins.The position-specific iterative search algorithm was used for protein sequences to obtain the position-specific score matrix PSSM,and the logistic function was used to normalize the PSSM matrix.Feature vectors for each amino acid residue in a protein sequence are extracted using a variable sliding window based on mirrored residues.(2)Two sampling methods based on CNMW(Clustering Near Miss-2 Weighted)under-sampling and neighborhood cleaning under-sampling are studied.Based on CNMW under-sampling,K clustering is performed on the majority class samples to obtain K clusters,and each cluster is assigned a corresponding weight according to the Near Miss-2 distance,that is,the first weight of the majority class sample.Then from the global consideration of the sample,use the idea of the nearest neighbor to assign a weight to each sample in the sample set,that is,the second weight of the sample.At this time,each majority class sample in the sample set has two different weights.Multiply the two weights corresponding to the majority class samples to obtain a new weight,sort the new weights from large to small,and select the majority class samples as many as the minority class samples in this order,and form a new weight with the minority class samples.data set.Neighborhood cleaning and under-sampling For each data sample in the data set,select its three nearest neighbor samples to form a set M.For the non-binding site sample p,if at least two of M are binding site samples,remove them.p;for the binding site sample q,if there are more than two non-binding site samples in M,remove the non-binding site samples in M.(3)Two prediction models for protein-GDP and protein-GTP binding sites are proposed.Neighborhood cleaning under-sampling and SVM are combined into NCL＿S prediction model,and CNMW＿S prediction model is combined based on CNMW＿S under-sampling and SVM.On the standard dataset,five-fold cross-validation experiments and independent testing experiments were performed for protein-GDP binding site prediction and protein-GTP binding site prediction to test the performance of the two prediction models.The experimental results show that both prediction models improve the prediction performance to a certain extent.

Keywords/Search Tags:

protein-GDP/GTP binding sites prediction, position specific scoring matrix, under-sampling, sliding window, SVM

PDF Full Text Request

Related items

1	The Dynamic Method Of Transcription Factor Binding Sites Recognition Based On Genetic Algorithm And Position Specific Scoring Matrix
2	Prediction Of The Correlation Of Triplet Transcription Factor Binding Sites Based On PWMSA
3	Predicting MHC-â…¡ Binding Affinity Using DE And PSSM
4	Prediction Method Research Of Protein Based On Protein Evolutionary Information
5	Prediction Of Bacterial Type Ⅳ Secreted Effectors And Phage Virion Proteins By Integrating Sequence And Evolutionary Information
6	Study Of Metal-ion Binding Sites For Disease-associated Proteins
7	Prediction Of Protein Structure Based On Multi-Information Fusion
8	The Machine Learning Model Of Protein Structural Prediction Based On Protein Sequence
9	Research On Several Sequence Information Extraction Methods And Subcellular Location Prediction Of Proteins
10	Combining Feature- And Template-based Strategies To Predict Nucleic Acid-binding Residues In Proteins