| Objiective:(1)To fully understand the underlying mechanisms of TF-mediated gene regulation,it is critical to accurately identify TF binding sites and predict their affinity.Recently,deep learning(DL)algorithms have achieved promising results in predicting DNA-TF binding;however,various deep learning architectures have not been systematically compared,and the relative merits of each architecture remain unclear.(2)To gain insight into the biological function of 4m C,it is critical to identify its modification sites in genomics.In contrast to expensive,time-consuming and complex experimental approaches,machine learning-based computational methods have become increasingly popular in recent years.Moreover,as the most advanced branch of machine learning,deep learning algorithms are able to automatically learn features of DNA sequences without the need for complex feature engineering and are therefore often used for 4m C locus identification.However,there is a lack of systematic analysis on how to build predictive models using deep learning techniques,and the predictive performance of many deep learning methods still needs to be improved.Methods(1)Four different deep learning architectures of deep neural networks(DNNs),convolutional neural networks(CNNs),convolutional-recursive neural networks(CNN-RNNs),and hybrid neural networks integrating CNNs and DNNs combined with different feature encoding methods were applied to the SELEX-seq and HT-SELEX datasets,which cover three species and 35 families,and the performance of these four deep learning architectures was evaluated and compared using 10-fold cross-validation;the features learned in the hybrid models were explored by investigating the 200 sequences with the highest prediction scores in 10-fold cross-validation for each TF,and the four motifs learned by the CNN+DNN hybrid models were compared with the corresponding motifs recorded in JASPAR;the CNN+DNN hybrid model results are compared with existing excellent methods.(2)We used a typical standard dataset(Zeng_2020_1)including three species(A.thaliana,C.elegans and D.melanogaster)with 20,000 positive and negative samples and a sequence length of 41 nt.First,we performed three deep learning architectures(CNN,RNN,CNN+Bi LSTM)with extensive hyperparameter tuning.Then,the optimal models of different deep learning architectures were selected for performance comparison on the A.thaliana and D.melanogaster datasets.In addition,we added attention mechanisms to the RNN and CNN-RNN architectures.Further,six encoding methods are designed to be applied to the optimal model.Finally,to better explain the "black box" of deep learning,we designed two visualization methods,UMAP and Deep SHAP,to explain its operation.To explore the generalization performance of our method,we performed the same operation on another dataset(Zeng_2020_2).Results(1)In deep learning applied to regression prediction of transcription factors,different deep learning architectures(CNN,CNN-LSTM,CNN-BILSTM)based on the same encoding method(one-hot)applied to the three species dataset,CNN showed the best performance;based on different k-mer(k=4 or 5)composition feature coding under the same deep learning architecture(DNN),when k=5 the best performance of DNN when applied to the three species dataset;the hybrid CNN+DNN model showed the best performance compared to other models built on CNN or DNN alone,with an average Pearson correlation coefficient of 0.912 for Drosophila,0.930 for human and 0.918 for mouse;and the hybrid model showed almost identical patterns compared to the JASPAR model.The CNN+DNN model performed significantly better than the current state-of-the-art methods in the HT-SELEX dataset.(2)On the Zeng_2020_1 dataset,the CNN model has the best performance when the number of filters,convolutional kernel size,and pooling layer size are 150,7,and 2,respectively;the RNN model has the best performance when the LSTM size is 128;the CNN-RNN model has the best performance when the number of filters,convolutional kernel size,pooling layer size,and LSTM size are 150,3,2,and 128,respectively.After adding the attention mechanism,the performance of RNN and CNN-RNN models improved,and the CNN-RNN_attention model resulted in the best performance among all models.1-mer_onehot outperformed other encoding methods in terms of accuracy and MCC in the performance comparison of different encoding methods for the CNN-RNN_attention model.UMAP can be used to visualize the relationship between learned representations of 4m C and non-4m C sites;the Deep SHAP method identifies sequence features associated with 4m C sites,and the most important features at each position of the input sequence are visualized by the sequence marker map.Conclusion(1)CNN models can effectively extract features directly from DNA sequences for quantifying TF-DNA binding affinity,but RNN models fail to effectively capture sequence order information,which may be due to the short length of DNA sequences;CNN and DNN models learn different DNA features that can complement each other in the task of predicting relative TF-DNA binding affinity;our proposed hybrid CNN+DNN model can accurately detect TF binding motifs.(2)After a series of optimizations,the best overall prediction performance is achieved by convolutional-recurrent neural network architecture using one-hot encoding and attention mechanism;UMAP and Deep SHAP methods can effectively explain the "black box" pattern of deep learning methods. |