| Enhancers and promoters are important regulatory elements that control gene expression.Enhancers are DNA sequences typically located upstream of gene promoters that enhance or suppress gene expression levels by binding to transcription factors,while promoters are DNA regions that control transcription initiation by binding to RNA polymerase.Enhancer-promoter interactions play a critical role in determining the spatial and temporal specificity of gene expression.However,experimentally determining enhancer-promoter interactions is time-consuming and expensive.Therefore,computationally predicting enhancer-promoter interactions has become an attractive alternative method.In this study,a computational framework for predicting enhancer-promoter interactions using random forest algorithm is proposed from two perspectives,sequence and epigenomic signals.From a sequence perspective,the KIR model was developed to study enhancer-promoter interactions.This model uses the K-mer feature extraction method to extract sequence features,obtains the optimal feature vector set through the IG feature selection method,and builds a prediction model based on random forest.The model achieved precision,ACC,AUC,and AUPRC of 0.792,0.859,0.836,and 0.654,respectively,on an independent test set in the GM12878 cell line,outperforming the EPIVAN model built on a deep neural network.The KIR model can effectively predict EPIs in the GM12878 cell line.From an epigenomic signal perspective,the HARD model was developed to study enhancer-promoter interactions.This model selected ATAC-seq,H3K27 ac,RAD21,and distance as feature inputs and used deep Tools to extract feature vectors related to epigenomic information.The distance feature vector was obtained by calculating the number of base pairs from the midpoint of the enhancer region to the midpoint of the promoter region.The final feature matrix was obtained by concatenating the epigenomic information feature vector with the one-dimensional distance feature vector.Finally,the model was built based on random forest.The HARD model exhibited excellent performance on an independent test set in the GM12878 cell line,with precision,ACC,AUC,and AUPRC of 0.799,0.887,0.919,and 0.773,respectively,outperforming the KIR,EPIVAN,and RF(10)models.Further research showed that the HARD model also performed well in predicting EPIs in the Hela cell line,with precision,ACC,AUC,and AUPRC of 0.660,0.836,0.831,and 0.601,respectively.This part of the research demonstrates that chromatin accessibility and cohesin binding are important for EPIs and also indicates that the HARD model trained on the GM12878 cell line can be applied to EPI prediction in other cell lines,with broad potential applications.This study provides a powerful and scalable method for predicting enhancer-promoter interactions,which can greatly promote the identification of gene regulatory networks and help understand the control of gene expression. |