Interpretable Enhancer Prediction Based On Deep Learning And XGBoost

Posted on:2023-03-19

Degree:Master

Type:Thesis

Country:China

Candidate:X B Ren

Full Text:PDF

GTID:2530307097485474

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Enhancers are a class of non-coding DNA sequences.Due to the special location near the structural genes.As well as being able to enhance gene transcription by binding to proteins such as transcription factors in the n ucleus to act on the promoter during the growth of eukaryotes.Therefore,the accurate prediction of enhancers and their strength can greatly enhance our understanding of the molecular mechanisms of gene transcription and facilitate our subsequent related studies.Traditional biological methods to identify enhancers have the drawbacks of high cost and long cycle time.With the development of machine learning,more and more computational methods are proposed for the prediction of enhancer sequences,but the existing models need to be improved in terms of accuracy and interpretability.The main reasons for choosing non-enhancers and enhancers(strong enhancers,weak enhancers)as the research objects in this paper are as follows: enhancers,as a current research hotspot,are closely related to the occurrence of many tumor diseases,and there are many subtypes of enhancers,the most important of which are strong enhancers and weak enhancers.This paper is mainly devoted to study the differences between enhancers and their subtypes and non-enhancer sequences,to effectively improve the prediction accuracy and efficiency of computational models,to provide interpretability to traditional black box models,and to provide a more efficient and fast method for enhancer identification in the field of biology,and the main research contents are as follows.XGBoost-based integrated learning interpretability model.The XGBoost-based integrated learning model consists of two layers of classifiers(a first layer of classifier s to identify enhancers and a second layer of classifiers to identify the intensity of enhancers)and uses "XGBoost" as the base classifier and five feature extraction methods(k-Spectrum Profile,Mismatch k-tuple,subsequence profile k-mer,positionspecific score matrix,and Pseudodinucleotide composition).I put the feature vector matrix into the integrated learning for fusion.This experiment uses the "Shapley interpretation" approach to provide interpretability and improve the credibility of previous black-box machine learning methods.A deep learning interpretability model based on a self-attention mechanism,i Enhancer-CLA,is used to identify enhancers and their strengths and weaknesses.Specifically,i Enhancer-CLA automatically learns sequential one-dimensional features through a multiscale convolutional neural network(CNN)and employs a self-attention mechanism to represent global features formed by multiple elements(multibody effect).In particular,the model can provide enhanced interpretable ana lysis of sub-basis sequences and keystone signals by decoupling CNN modules and generating self-attentive weights.To avoid the bias of manually setting hyperparameters,I construct Bayesian optimization methods to obtain globally optimized hyperparameters of the model.Importantly,the analysis reveals that the distribution of bases in enhancers is uneven,with base G being more abundant,while bases are relatively evenly distributed in non-enhancers.This result helps to improve the prediction performance and thus reveals an in-depth understanding of the enhancer sequence characteristics.

Keywords/Search Tags:

Sequence Analysis, Enhancer, XGBoost, Self-attentive mechanism, Convolutional neural networks, Deep learning, Explanatory machine learning

PDF Full Text Request

Related items

1	Deep Learning-based Approach To Identify Enhancer-promoter Interactions
2	Research And Implementation Of Deep Learning-based Prediction Of Super-enhancer-promoter Relationship
3	Deep Learning Based Enhancer Regulatory Sequence Recognition Research
4	Investigating Deep Neural Networks For Gravitational Wave Evaluation With Deep Learning Ligo Data
5	Precipitation Forecast Spatiotemporal Sequence Prediction Research Based On The Fusion Of Deep Learning And Ensemble Learning
6	The Research On The Prediction Method Of Protein Succinvlation Sites Based On PU Learning And Deep Learning Technology
7	Deep Convolutional Neural Networks For 3D Image Analysis
8	Research On Clustering Methods Of Single-cell RNA Sequence Based On Machine Learning
9	Research On Key Issues And Algorithms Of Quantum Machine Learning
10	Use Of Deep Neural Networks In Seismic Data Identification Of Faults And Porosity Prediction