Font Size: a A A

Research On Prediction Of S-nitrosylation Proteins And Sites By Fusing Multiple Features

Posted on:2022-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q K WangFull Text:PDF
GTID:2480306611457954Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Protein post-translational modifications(PTMs)are processes that increase the functional diversity of the proteome through proteolytic cleavage of subunits or modulation of overall protein degradation,covalent addition of proteins or functional groups.PTMs play important roles in many physiological processes.To date,more than 400 protein post-translational modifications,including phosphorylation,S-nitrosylation,acetylation,and ubiquitination,have been identified.Traditional experimental methods to identify modification sites are not only time-consuming and cost-effective,especially for the identification of modified proteins,few people use computational methods to do it.Therefore,it is very important to develop a mathematical calculation method to distinguish modified proteins from non-modified proteins,and then to predict specific modification sites for understanding the basic physiological processes of organisms and studying related drugs.Extracting protein sequence information is a particularly important step in mathematical computational methods.However,in the current research,the simplification of the feature extraction of the sequence leads to the poor retention of sequence information,which affects the final prediction result.Therefore,it is particularly important to explore some feature extraction methods that can preserve sequence information more comprehensively.As one of more than 400 PTMs,S-nitrosylation(SNO)is a process of covalent modification of nitric oxide(NO)and its derivatives and cysteine residues,which plays an important role in various biological processes.This study proposes two models to identify SNO proteins and their post-translational modification sites.For the identification of SNO proteins,three feature extraction methods,K-nearest neighbor scoring matrix based on GO annotation information,pseudo-amino acid composition(Pse AAC)and bag of words based on amino acid physicochemical properties were used to extract features from protein sequences.These three methods extract the sequence features to the greatest extent according to the sequence information,structural features,annotation information and physicochemical properties of amino acids existing in the protein itself.In addition,in order to preserve the sequence information more comprehensively,this paper will also fuse the various features obtained.Second,to reduce the negative impact of dataset imbalance,this study balances the dataset by combining synthetic minority oversampling and random deletion,and then feeds the balanced feature vectors into random forests.After five-fold cross-validation,the accuracy rate(ACC),Matthew's correlation coefficient(MCC),and area under the ROC curve(AUC)were 81.84%,0.5178,and 0.8635,respectively.Since the multi-feature fusion well preserves the sequence information,the prediction effect of this study is better than other prediction models in the study of such problems.This good prediction effect lays the foundation for faster and more accurate identification of specific SNO sites.Two feature extraction methods based on tripeptide composition(TPC)and Kspaced amino acid pair composition(CKSAAP)were used to construct a model for predicting S-nitrosylation sites.These two methods retain the sequence fragment information to the greatest extent possible.In order to preserve the sequence fragment information more completely,the obtained features are also fused here.In addition,in order to eliminate redundant information after feature fusion and improve model work efficiency,elastic network is used for feature selection.The comparison between the five-fold cross-validation test results and the existing predictors shows that the prediction effect of the model proposed in this paper is significantly improved.In this paper,the problem of identifying S-nitrosylated proteins by fusing multiple sequence information is proposed for the first time,and the problem of Snitrosylation sites is explored based on S-nitrosylated proteins.By comparison,this study confirms that multi-feature fusion can preserve protein sequence information more comprehensively,so as to achieve better prediction effect.Therefore,it is considered that multi-feature fusion is very necessary for the prediction of proteins and their corresponding sites.
Keywords/Search Tags:S-nitrosylation, random forest, post-translational modification, multiple features, identification
PDF Full Text Request
Related items