Font Size: a A A

Classification Models For Predicting Sumoylation Sites And RNA Binding Proteins

Posted on:2022-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:H C ChenFull Text:PDF
GTID:2504306485950159Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein plays an important role in life activities.Human understanding of the structure and function of proteins are conducive to exploring the operating mechanism of life and promoting the development of therapeutic drugs.Among them,the combination of protein and nucleic acid has an important regulatory effect on cell transcription.Moreover,post-translational modifications of proteins are widespread in the cellular translation process.Therefore,the prediction of nucleic acid binding proteins and protein post-translational modification sites are of great significance for understanding protein functions.The traditional biological experiment methods have the long period,high cost,and cannot satisfy the forecast of the ever-increasing large-scale protein data.At present,calculation-based methods consume less cost and are simple and efficient.In particular,the development of machine learning makes computing models based on machine learning a potential option.Therefore,this thesis applies machine learning classification methods to the prediction of sumoylation sites and RNA-binding proteins,and proposed the effective prediction models.Aiming at the prediction of protein sumoylation sites,this thesis proposed a prediction model SUMO-LGBM based on sequence features.Firstly,the model described the amino acid residues with the statistical characteristics of the physicochemical properties of amino acids and the characteristics of the binary syntax pattern of the amino acid sequence.Secondly,we train a Light Gradient Boosting Machine(Light GBM)classification model to locate sumoylation sites from the amino acid residues of the protein sequence.This thesis compared the recognition capabilities of different features and the prediction performance of different classification models.We performed the ten-fold cross-validation on the benchmark dataset.Compared with the existing methods,the performance of the proposed model in this thesis has been significantly improved.The metric of Matthews Correlation Coefficient(MCC)is91.64%,and the AUC value is 99.57%.The experimental results proved the effectiveness of the proposed method and it can be used as an auxiliary method to verify the sites of protein sumoylation in biological experiments.Aiming at the prediction of RNA binding proteins,this thesis proposed a novel prediction model CnnEtRBP.The model is based on the statistical characteristics of the tripeptide frequency of protein sequences.Firstly,we applied convolutional neural networks for feature extraction.Secondly,an extreme random tree classifier is trained.In addition,the model uses Synthetic Minority Oversampling Technique(SMOTE)to upsampling the minority samples to alleviate the problem of data imbalance on the training set.On the independent test sets of three different species,the AUC value of the test results of this model achieved the state-of-the-art performance,which is more than 2% higher than the second-ranked method.The experimental results show that the proposed method was effective and can provide an effective candidate target for the experimental method to identify RNA-binding proteins.
Keywords/Search Tags:Protein sequence, Protein sumoylation, RNA binding proteins, SUMO-LGBM, CnnEtRBP
PDF Full Text Request
Related items