| Proteins are composed of twenty different amino acids,such as lysine(K).Posttranslational modifications(PTM)of proteins are produced by the covalent binding of specific amino acids on proteins to chemical groups or small molecule proteins.It can dynamically regulate a variety of biological processes,and therefore the accurate identification of PTM sites is important to study their functions.For example,protein SUMO modification is the covalent binding of SUMO,a small molecule protein,to K on a protein to regulate biological processes such as transcription factor activation and DNA damage repair.PTM sites can be identified by both experimental and computational methods.The former identification is highly accurate but costly,labor-intensive and inefficient.The latter is fast and high-throughput,although its accuracy is closely related to the training set size and model algorithm.For example,the current SUMO modifier locus prediction mainly uses traditional machine learning algorithms,and less often uses deep learning algorithms,and the amount of data used for modeling is much smaller than the current experimental identification of SUMO modifier sites(~68,000),lacking deep learning prediction models based on large data sets.The main work of this thesis is to construct prediction models based on the identified SUMO modifier sites data and to extend the modeling methods,aiming to develop a general deep learning modeling framework.Specifically,it includes the following four parts:(1)Evaluation of reported SUMO prediction models using large-scale SUMO modification data reveals their poor performance(both in terms of the generalization ability of the models and the adaptability of the modeling algorithms).Therefore,better prediction models need to be developed.(2)Prediction models with different frameworks were constructed and evaluated,and the best performance was found for the model based on the residual structure(named ResSUMO).The study used traditional machine learning algorithms,combined with different coding approaches to develop prediction models,and evaluated the performance with a five-fold cross-validation set.Second,a residual network is combined to solve the performance degradation problem caused by increasing the number of layers of CNN models.The results show that ResSUMO is a model with high prediction accuracy and robustness.(3)A hybrid deep learning model based on attention mechanism(called HDeepSpred)is developed for the limitation that ResSUMO only considers local contextual features.It is able to extract both local and full-length information in protein modification sequences.HDeepSpred uses One-Hot coding as input features and combines a residual model based on channel attention and spatial attention with a BiGRU model.It was found that HDeepSpred outperformed the other models.(4)A modification site prediction package(DLKit)based on deep learning algorithms was developed.In general,building prediction models for modification sites requires complex processing steps and specialized data science knowledge.To simplify these requirements,several general-purpose modification site prediction software packages have been recently reported.They facilitate the rapid development of specific prediction models,though mostly based on traditional machine learning algorithms,without considering the advantages of deep learning.Therefore,we developed DLKit,capable of building prediction models based on deep learning algorithms for biological sequences. |