| Epigenetic modification refers to heritable changes in gene expression without changing the DNA sequence,which mainly involves nucleotide modification(DNA and RNA modification),and protein post-translational modification(PTM).Currently,more than 600 types of known epigenetic modifications have been found,such as DNA methylation,RNA methylation,and protein crotonylation.Accumulating evidences have indicated that epigenetic modification regulates various biological processes by controlling gene expression,which play a crucial role in embryonic development,tissue differentiation,and disease development.Therefore,the study of epigenetic modification is a necessary way to reveal various life phenomena.The primary problem in the study of epigenetic modification is how to determine the modification sites in biomacromolecules with high-precision and large-scale.Currently,a considerable number of high-throughput technologies have been developed for the detection and idenfication of modification sites.However,these techniques based on molecular biology experiments require expensive manpower and material costs,and are complicated to operate.Therefore,the application of these techniques to large-scale whole genomes,transcriptomes and proteomes has certain limitations.Compared with the traditional wet experiment method,the computational bioinformatics method can well make up for the above shortcomings,and has a wider application prospect in the era of big data.Focusing on epigenetic modification,this dissertation constructs a database of plant epigenetic modification site,and propose multiple prediction models for modification sites throughout the genome,transcriptome and proteome.The main research content of this dissertation is as follows:(1)Aiming at the storage,annotation and information mining of high-throughput plant epigenetic modification data,a comprehensive database called Plant EMS was constructed to analyze and visualize DNA,RNA and protein PTM information in plants.In the database,4 DNA modification types including 12,970,352 modification sites,26 RNA modification types including 23,500 modification sites,and 23 protein PTM types containing 132,085 modification sites in 51 plants were collected and sorted out.Therefore,it will meet the new demand for convenient and comprehensive analysis tools in the research field.(2)For the identification of DNA modification site,a multi-species and multi-type integrated prediction framework i DNA-MS was developed.In the framework,the DNA sequences were encoded by k-tuple nucleotide frequency components,nucleotide chemical properties and nucleotide frequencies,and mono-nucleotide binary encoding strategy.Subsequently,a random forest was used to build classfier for identifying5-Hydroxymethylcytosine,N6-Methyladenosine,and N4-Methylcytosine sites.The results of cross-validation and independent test sets showed that i DNA-MS obtain robust predictive performance in identifying the three modification sites in 17 genomes.In addition,we found the potential signal acheieved by the special nucleotide distribution pattern when methyltransferases function,filled the gap between DNA modification and chromatin conformation,and provided a new insight for understanding the mechanism of epigenetic modification.Based on the proposed model,a web server named i DNA-MS was established,which can be freely accessed at http://lin-group.cn/server/i DNA-MS.(3)Focusing on the problem of RNA modification site prediction,we developed a species-specific model named i RNA-m5 C and a tissue-specific model named i RNA-m6 A by using optimal feature fusion strategy and machine learning algorithms.Specifically,i RNA-m5 C was able to identify 5-methylcytosine modification sites in Homo sapiens,Mus musculus,Saccharomyces cerevisiae,and Arabidopsis thaliana.i RNA-m6 A was designed to identify N6-Methyladenosine modification site in Homo sapiens(brain,liver,and kidney),Mus musculus(brain,liver,heart,testis,and kidney),and Rattus norvegicus(brain,liver,and kidney).Cross-validation and independent set testing results demonstrated that i RNA-m5 C and i RNA-m6 A could yield superior performance to existing tools.In addition,we found that the natural vector method can represent the global sequence order information,solve tha problem of the representation of modified sequences in high-dimensional space,and provide key clues for revealing the mechanism of RNA modification at the species/tissue level.Based on the two proposed models,webservers named i RNA-m5 C and i RNA-m6 A were established,which can be freely accessed at http://lin-group.cn/server/i RNA-m5 C and http://lin-group.cn/server/i RNA-m6 A,respectively.(4)About the recognition of PTM site,a model named i Rice-MS was developed by using multimodal feature encoding method and extreme gradient boost algorithm.The model was able to identify 2-hydroxyisobutyrylation,crontonylation,malonylation,ubiquitination,succinylation,and acetylation modification sites in rice.The examination on independent dataset showed that the Area Under the Curve(AUC)of the model for the identification of the above six types of PTMs exceed 0.84,which prove the robustness of the model.In addition,we compared the performance of this model with other published tools,and proved the power of i Rice-MS.Based on the proposed model,a webserver named i Rice-MS was established,which is freely accessible at http://lin-group.cn/server/i Rice-MS.In order to explore the effectiveness of deep neural network combined with natural language processing in PTM site prediction,we developed Deep-Kcr,a Kcr site prediction framework available at https://github.com/lin Ding-group/Deep-Kcr,based on convolutional neural network.Deep-Kcr combines sequence-based features,physicochemical property-based features,and natural language processing with information-gain feature screening methods to generate model which could produce AUC of 0.885.Comprehensive feature analysis demonstrated the feasibility of natural language processing in the Kcr site prediction for the first time.In addition,we found a synergistic phenomenon between crotonylation and acetylation,confirmed the feasibility and effectiveness of sequence-physicochemical-spatial information in characterizing modified sequences,and provided multidimensional methods for understanding PTM mechanisms.To further explore the effectiveness of natural language processing in other types of PTM site prediction tasks,we proposed a deep learning prediction architecture Deep IPs to identify phosphorylation sites in host cells infected with SARS-Co V-2,which can be accessed at https://github.com/lin Ding-group/Deep IPs.Deep IPs is composed of the current popular natural language processing method combined with CNN-LSTM network architecture.The evaluation on independent dataset showed that the natural language processing-based method supervised embedding layer can achieve an AUC value of 0.887 in the identification of S/T phosphorylation sites;while Glo Ve can achieve an AUC value of 0.841 in the identification of Y phosphorylation sites,which demonstrates the superior performance of natural language processing in idenfying phosphorylation sites.In summary,this dissertation conducts a systematic study on the issue of epigenetic modification.We constructed the first plant epigenetic modification site database Plant EMS,and developed modification site prediction tools across the genome,transcriptome,and proteome.Aiming at the identification of modification sites,we explored the effectiveness of various machine learning algorithms and feature acquisition processing methods,preliminarily realized the accurate prediction of epigenetic modification sites,and provided important computational tools and reference information for subsequent experimental research. |