| Protein is the most important executor of various cellular functions,and its normal function determines whether life activities can be carried out in an orderly and efficient manner.Post-translational modifications play a crucial role in making proteins have more complex structures and more complete functions,and achieve finer adjustments.At present,more than 400 protein post-translational modifications(PTMs)have been detected,of which Pupylation and Ubiquitylation proteins are important PTMs that play a key role in the cellular function of microorganisms.Accurate prediction of Pupylation and plant Ubiquitylation proteins and their modification sites is of great significance for the study of basic biological processes and the development of related drugs,and so far no predictive studies on Pupylation and plant Ubiquitylation proteins have been proposed.In this paper,four models are proposed to predict Pupylation proteins,Pupylation sites,plant Ubiquitylation proteins,and plant Ubiquitylation sites,respectively,and the specific research work is described in detail below.In the prediction of Pupylation proteins and plant Ubiquitylation proteins and their modification sites,for a given protein sequence,if it can first predict whether it is Pupylation protein or Ubiquitylation protein,and then further predict its specific position modification sites on the basis of being predicted as Pupylation protein or Ubiquitylation protein sequence.This will greatly save experimental costs and improve work efficiency.In order to improve the prediction model of Pupylation proteins,the KNN scoring matrix model and Word Embedding model based on functional domain GO annotation are used to feature extract proteins sequences.Because the dataset for this proteins is unbalanced,this paper applies random undersampling(RUS)and synthetic minority oversampling techniques(SMOT)to balance the features.Finally,the balanced dataset is input into the extreme gradient boosting(XGBoost)classifier,and after 10-fold crossverification,the performance of the evaluation indicators is good,which proves that the model has good generalization ability.For predictive models of Pupylation sites,researchers have developed several large-scale proteomic methods to predict Pupylation sites,but there are still many sites to be discovered.Sequence-based prediction methods will help predict Pupylation sites.In this paper,six feature coding schemes,including TPC,AAI,One-Hot,Pse AAC,CKSAAP,and Word Embedding,were used to feature extract protein sequence fragments,and Chi-square test was used for feature selection.After 10%cross-validation,it shows that the accuracy rate is significantly higher than other existing methods.Finally,in order to facilitate the verification and experiment of relevant researchers,this paper establishes a prediction website called "PUP-PS-Fuse",which can be accessed in "https://bioinfo.jcu.edu.cn/PUP-PS-Fuse".On the issue of predicting plant Ubiquitylation proteins and their modification sites,in the literature related to Ubiquitylation sites,their research is cross-species and life patterns are unclear,plus prediction methods are species-specific.Therefore,in this study,only Ubiquitylation sites in plants were selected and studied.In this work,this paper first constructs a model to identify the Ubiquitylation proteins in plants.In order to better reflect protein sequence information,a variety of feature extraction methods are used.In this paper,a KNN scoring matrix model based on functional domain GO annotation and a word embedding model(CBOW and Skip-Gram)are used.Finally,the extracted feature vectors are fed into the Light Gradient Boosting Machine(LGBM)classifier.With 10-fold crossvalidation on independent datasets,the model achieves good results.This paper then constructs a model for predicting plant Ubiquitylation sites.Skip-Gram,CBOW,and EAAC feature extraction codes are used to extract protein sequence fragment features.After 10-fold cross-validation,it is compared with existing predictive models on independent datasets.After experimental comparative analysis,the model in this paper shows good robustness in predicting plant Ubiquitylation proteins and sites,and the performance is better than the previous prediction tools.The dataset and source code used in this article are freely available on https://github.com/gmywqk/Ub-PS-Fuse.In this paper,the prediction of Pupylation and plant Ubiquitylation proteins based on machine learning was proposed for the first time,and Pupylation and plant Ubiquitylation sites were further predicted based on Pupylation and plant Ubiquitylation proteins.Through experimental comparative analysis,it is confirmed that the computational method based on machine learning can predict Pupylation and plant Ubiquitylation protein and its modification site,which can achieve a better prediction effect. |