Enhancers are a class of non-coding DNA sequences.Due to the special location near the structural genes.As well as being able to enhance gene transcription by binding to proteins such as transcription factors in the n ucleus to act on the promoter during the growth of eukaryotes.Therefore,the accurate prediction of enhancers and their strength can greatly enhance our understanding of the molecular mechanisms of gene transcription and facilitate our subsequent related studies.Traditional biological methods to identify enhancers have the drawbacks of high cost and long cycle time.With the development of machine learning,more and more computational methods are proposed for the prediction of enhancer sequences,but the existing models need to be improved in terms of accuracy and interpretability.The main reasons for choosing non-enhancers and enhancers(strong enhancers,weak enhancers)as the research objects in this paper are as follows: enhancers,as a current research hotspot,are closely related to the occurrence of many tumor diseases,and there are many subtypes of enhancers,the most important of which are strong enhancers and weak enhancers.This paper is mainly devoted to study the differences between enhancers and their subtypes and non-enhancer sequences,to effectively improve the prediction accuracy and efficiency of computational models,to provide interpretability to traditional black box models,and to provide a more efficient and fast method for enhancer identification in the field of biology,and the main research contents are as follows.XGBoost-based integrated learning interpretability model.The XGBoost-based integrated learning model consists of two layers of classifiers(a first layer of classifier s to identify enhancers and a second layer of classifiers to identify the intensity of enhancers)and uses "XGBoost" as the base classifier and five feature extraction methods(k-Spectrum Profile,Mismatch k-tuple,subsequence profile k-mer,positionspecific score matrix,and Pseudodinucleotide composition).I put the feature vector matrix into the integrated learning for fusion.This experiment uses the "Shapley interpretation" approach to provide interpretability and improve the credibility of previous black-box machine learning methods.A deep learning interpretability model based on a self-attention mechanism,i Enhancer-CLA,is used to identify enhancers and their strengths and weaknesses.Specifically,i Enhancer-CLA automatically learns sequential one-dimensional features through a multiscale convolutional neural network(CNN)and employs a self-attention mechanism to represent global features formed by multiple elements(multibody effect).In particular,the model can provide enhanced interpretable ana lysis of sub-basis sequences and keystone signals by decoupling CNN modules and generating self-attentive weights.To avoid the bias of manually setting hyperparameters,I construct Bayesian optimization methods to obtain globally optimized hyperparameters of the model.Importantly,the analysis reveals that the distribution of bases in enhancers is uneven,with base G being more abundant,while bases are relatively evenly distributed in non-enhancers.This result helps to improve the prediction performance and thus reveals an in-depth understanding of the enhancer sequence characteristics. |