| Splicing is a key step in the transcription of DNA into RNA,which highly regulates the transmission of biological genetic information.Splice sites are recognized targets in the process of DNA splicing.Related researches on splicing sites not only help to understand the complex splicing mechanism,enrich and improve DNA sequence annotation,but also lay the foundation for the analysis of downstream RNA.The identification of splice site is a hot and difficult problem in transcriptome research.However,there are many shortcomings in existing splice site identification methods.For example,traditional machine learningbased prediction methods have problems such as relying on experts to manually extract and select features,too large feature input dimensions,and slow fitting of sample data.On the other hand,when researchers apply deep learning methods,they use deep learning networks as a black box,lack high-weight learning of important features,ignore interpretive analysis of model results,and have difficulty balancing model performance improvement while maintaining reliability.In view of the above deficiencies,we develop and implement an interpretable splice site prediction method,which not only has high prediction accuracy but also explains the correlation between the model prediction results and the input sample sequences,providing a basis for further understanding of the splicing mechanism.The main innovations of this paper are as follows:(1)A splice site prediction method based on convolutional neural network combined with attention mechanism is proposed,named SSP-Deep(Splice Sites Prediction based on Deep learning).The method encodes the splice site sequences with a length of 602 bases according to One-hot,uses a convolutional neural network as the main frame,and further integrates the CBAM attention mechanism to fully learn the sample features to improve the prediction accuracy.To demonstrate the generalizability of the method,we used SSP-Deep to construct donor and acceptor splice sites model in five species of Homo sapiens,Arabidopsis thaliana,Oryza sativa japonica,Drosophila melanogaster,and Caenorhabditis elegans.The experimental results show that compared with the existing prediction methods,SSP-Deep achieves the best results in many indicators such as accuracy,specificity,sensitivity,F score and area under the receiver operating characteristic curve.At the same time,SSP-Deep has excellent generalization ability,which can help those species with insufficient training samples to predict splice sites.(2)We design and implement an interpretive analysis method for splice site prediction model based on Grad-CAM.The method draws the weight-score curve through Grad-CAM technology,and visualizes the direct correlation between the sample sequences and the prediction result.We use Grad-CAM to explain the SSP-Deep model proposed in this paper.By analyzing and summarizing the contribution of different regions of the upstream and downstream subsequences of the splicing site to splicing,we find the basis for the model’s excellent performance and verify that the SSP-Deep method does learn valid information about the sample sequences.At the same time,we find that there are similar regularities in splicing site sequences in different species,which provides support for explaining the conservation of splicing mechanism.In summary,the SSP-Deep method has made new breakthroughs in the study of gene sequences splice site prediction.At the same time,the proposed explanatory scheme also provides an effective basis for researchers to more conveniently analyze and infer the functions and characteristics of splicing sites. |