| 5-methylcytosine(5-methycytosine)methylation is an important epigenetic modification,which plays a key role in the maintenance of life processes and the transmission of genetic information.It is a common method to study the corresponding methylation modification through the methylation sites.Therefore,accurate identification of the 5-methylcytosine site in DNA and RNA is of great importance for the study and understanding of the mechanism and function of this modification.In recent years,machine learning has been widely used in the prediction of 5-methylcytosine sites.Compared with traditional experimental methods,machine learning methods have the advantages of short time-consuming,low cost,and high accuracy.In this paper,two different types of prediction models were constructed for the 5-methylcytosine methylation sites of DNA and RNA through machine learning,deep learning and other technologies.The main work and research contents are as follows:(1)In terms of RNA 5-methylcytosine(m5C)sites prediction,Aiming at the single problem of feature extraction and classification model of existing prediction models,6 existing sequence-based feature extraction methods and9 commonly used machine learning algorithms were first evaluated;then a feature based on ensemble learning was designed The feature selection method of importance,which uses the feature importance of three different types of integrated learning models of random forest,Ada Boost,and Xg Boost as the basis for feature selection;finally,an integrated learning model based on Stacking is constructed,which includes two layers,Random Forest,Support Vector Machine(SVM),Xg Boost,and Light GBM are used as the base classifier of the Stacking integrated model,and Logistic Regression is used as the secondary classifier.The test results on Arabidopsis and Mouse data sets show that the feature selection method and Stacking integrated model proposed in this paper can effectively improve the prediction accuracy of m5 C sites,and the test accuracy on Arabidopsis is better than that of The best existing ensemble learning method.(2)In terms of DNA 5-methylcytosine sites(5m C)prediction,a 5m C sites prediction model named Nanoformer was constructed based on the thirdgeneration gene sequencing technology Nanopore sequencing and the deep learning algorithm Transformer.Aiming at problems such as the complex feature extraction process of existing methods,insufficient information extraction,and weak model generalization ability,Nanoformer uses the original electrical signal and DNA sequence of Nanopore sequencing data as input features,and uses the Transformer encoder based on the attention mechanism to classify the features.Encoding is performed,and at the same time combined with a bidirectional longshort-term memory cycle network(Bi LSTM)to extract bidirectional features of DNA sequences,and finally input into a two-layer fully connected network to predict 5m C sites.The prediction results of Arabidopsis and rice data show that Nanoformer can accurately predict 5m C sites and non-5m C sites,achieve better prediction performance than existing similar methods,and can perform crossspecies prediction. |