The promoter is an essential element of the DNA sequence,usually located near the transcription start sites of genes.The promoter is also the starting point for RNA polymerase to transcribe specific genes and therefore plays a crucial role in the regulation of gene transcription.Promoters are also associated with many human diseases and may be a major cause of disease induction.Its importance in molecular biology and genetics has aroused the research interest of researchers.In addition,understanding the regulatory effect of enhancerpromoter interactions(EPIs)on specific gene expression in cells has contributed to understanding gene regulation,cell differentiation,and other aspects.With the rapid development of high-throughput sequencing technology,the number of DNA sequences available to people shows an explosive growth trend.Using traditional biological methods and wet experimental methods to identify promoters and EPIs can no longer meet the demand.Therefore,a study based on biological sequence information and machine learning methods for E.coli promoters and EPIs of six human cell lines is presented in this paper.The main research contents are as follows:(1)For the prediction of E.coli promoters,this paper proposes a more advanced prediction model,iProL,based on the Longformer pre-trained model in natural language processing.iProL does not rely on biological prior knowledge and identifies promoters by DNA sequences alone.In addition,it combines Convolutional Neural Network(CNN)and Bi-directional Long Short-Term Memory(BiLSTM)for extracting local and global features of DNA sequences.The experimental results show that iProL achieves the highest scores on Sp,Acc,MCC,and AUC compared to the latest published methods,with 86.61%,85.62%,0.7130,and 0.9211 respectively.Therefore,iProL has a superior predictive performance and a more balanced recognition of positive and negative samples,which offers the possibility of detecting new promoters.(2)To address the problem of predicting EPIs for six human cell lines,this paper designs a cell line-specific EPIs prediction method based on a stacking ensemble learning strategy,which has better prediction performance and faster training speed,called StackEPI.Specifically,by combining different encoding schemes and machine learning methods,our prediction method can multifacetedly extract the cell line-specific effective information of enhancer and promoter gene sequences comprehensively and make accurate recognition of cell line-specific EPIs.Comparative results show that our model can deliver better performance on the problem of identifying cell line-specific EPIs and outperform other stateof-the-art models.In addition,our model also has a more efficient computational speed than other methods. |