| DNA methylation plays an important role in various developmental and physiological processes that are associated with many human diseases,and accurate prediction of methylation sites can help to understand the mechanisms of these biological processes and the pathogenic mechanisms of related diseases.Although experimental identification of DNA methylation sites has been possible up to single-base resolution,experimental methods are time-consuming and costly,and therefore computational methods are of increasing interest to researchers.Although some computational methods have been developed,most of these rely on manually extracted features,and manual features rely on expert experience and do not fully represent the biological information contained in DNA sequences,so the predictive power of these methods needs to be improved.In this thesis,we propose to improve the predictive performance of the model by improving its understanding of sequence data through an effective representation of DNA sequences by a two-way encoder representation model(BERT)based on Transformer.The promoter-specific BERT model,Promoter-BERT,was pre-trained based on the promoter sequence of human genome,and the 5m C-related training data were used to finetune the Promoter-BERT to establish the fine-tuning model for prediction of the 5m C locus.In the process of pre-training promoter-specific BERT models,1-mer,3-mer,and 5-mer are set as different word lengths to train different pre-training models,and the optimal pretraining model is selected by fine-tuning;further,the embeddings of the Promoter-BERT model and the fine-tuned model are extracted as two features,and the NPF features are extracted again with the two previous embeddings.The four groups of features are fed into four learning algorithms to build a model,and the optimal model is selected using five-fold cross-validation to evaluate and compare the performance of all models;the model based on the extracted embedding features is compared with the results of the fine-tuned model to select the optimal model;a new method for predicting 5m C sites is constructed,named BERT-5m C;finally,BERT-5m C is compared with the existing 5m C locus prediction method.Fine-tuning with Promoter-BERT outperformed fine-tuning with DNABERT on 17 datasets at 4m C,6m A,and 5hm C.The AUROC values for fine-tuning using PromoterBERT were higher than those for fine-tuning to DNABERT when predicting on 16 of these datasets for the independent test set;when comparing with other existing prediction methods,based on the AUROC values for the independent test set,the model of this study achieved higher AUROC values on two 5hm C datasets,two 4m C datasets,and seven 6m A datasets,achieving the optimal results.When fine-tuning Promoter-BERT using 5m C data,the best prediction performance was achieved with k = 3,and the five-fold cross-validated AUROC,AUPRC,MCC,and ACC were 0.966,0.602,0.653,and 0.932,respectively;when using the BERT embedding as a feature,the best prediction performance was achieved using the fine-tuned Promoter-BERT model The model prediction performance is the best when using the embedding and FFNN algorithm,and the five-fold cross-validated AUROC,AUPRC,MCC,and ACC are 0.964,0.5696,0.645,and 0.92,respectively;comparing the model based on the embedding features with the fine-tuned model finds that the fine-tuned-based model is superior,so the finetuned-based model is chosen as the final model named BERT-5m C.BERT-5m C has ACC,MCC,and AUROC of 0.933,0.656,and 0.966,respectively,on the independent test set,and the results are better than other existing 5m C predictors.Fine-tuning with Promoter-BERT outperformed fine-tuning with DNABERT on 17 datasets at 4m C,6m A,and 5hm C.The AUROC values for fine-tuning using Promoter-BERT were higher than those for fine-tuning to DNABERT when predicting on 16 of these datasets for the independent test set;when comparing with other existing prediction methods,based on the AUROC values for the independent test set,the model of this study achieved higher AUROC values on two 5hm C datasets,two 4m C datasets,and seven 6m A datasets,achieving the optimal results.The BERT model achieves good results in DNA methylation site prediction,providing a more effective means for researchers to better understand the role of DNA methylation.Increasing the number of fine-tuning iterations can improve the performance of the BERT model for fine-tuning on small sample data.It provides some reference for the development of DNA methylation prediction methods based on the BERT model. |