Font Size: a A A

Prediction Of Enhancers And N4 Methylation Sites Based On Ensemble Learning And Deep Learning

Posted on:2023-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q T GengFull Text:PDF
GTID:2530306617470604Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Gene expression is a process of synthesizing genetic information into functional gene products,which is strictly regulated in time and space.The accurate identification of DNA molecular loci or fragments with specific functions in the regulation of gene expression contributes to the understanding of the mechanism of expression of biological genes at the transcriptional level.Enhancers act as classical activating elements that bind to transcription factors and chromatin modifiers at specific sites,affecting gene expression and tissue specificity of cell growth and differentiation.Identifying enhancers is challenging as they do not encode any protein sequences and are randomly distributed in 98%of the human non-coding genome.Accurate identification of enhancers can help decipher the transcriptional processes of structural genes and contribute to understanding cell life mechanisms.The N4-methylcytosine serves as an essential gene modification that influences the regulation of gene expression by protecting the integrity of the host genome.Accurate identification of 4mC motifs can help to understand disease gene expression mechanisms and support gene-drug design.Based on machine learning,this thesis provides an in-depth exploration of biological enhancers and their types and N4 methylation site identification.The detailed contents are summarized as follows.(1)Based on ensemble learning,a new prediction method for enhancers is proposed using decision tree as base classifier.First,the sequences are encoded with a multi-source feature extraction strategy.Considering the possible redundant and irrelevant features,a more efficient feature matrix is obtained by Recursive Feature Elimination and SelectKBest.Then,the AdaBoost integration algorithm and Bagging integration algorithm are applied to predict the augmented subsequence,respectively.The experimental results show that the prediction framework with the ensemble learning idea has better accuracy and sensitivity than individual decision trees.And the feature space obtained through feature selection has a more concise and efficient representation,reducing the computational complexity.(2)Given that biogenetic sequences are similar to sentences in natural language,a framework for recognizing enhancers and their types based on FastText word embedding and deep learning models is developed.RankGAN network is employed to expand the amount of non-enhancer,strong-enhancer,and weak-enhancer data because of the small data size of the augmented subsets.Then,the segmentation of the sequences is performed using sliding word splitting.The segmented biological sequences are fed into the FastText model for training to obtain a distributed representation of biological words.Finally,a deep neural network framework combining Long-Short Term Memory and Convolutional Neural Network is used to perform the recognition task.The experimental results demonstrate that the enhancer subsequence generated based on RankGAN network has a nucleotide content similar to that of the original enhancer subsequence,which proves the effectiveness of sequence generation.The proposed method achieves more satisfactory results on the cross-validation set and the independent test set than existing methods.(3)An ensemble learning framework for identifying 4mC loci is proposed using multiple machine learning algorithms as the base classifier and using a weighted average method.First,feature extraction algorithms such as Z-Curve,PP,gcContent,atgcRatio,cumulativeSkew and pseudoKNC are used to mine mouse DNA sequence information.Then,redundant feature information is removed by Recursive Feature Elimination and XGBoost to improve the efficiency of the predictor.A comparative analysis of the prediction performance of multiple machine learning algorithms is performed on an independent test set.The machine learning algorithm with better performance is selected as the base classifier,and an integrated learning framework is constructed using the idea of weighted averaging.The experimental results show that the ensemble learning framework possesses better prediction ability than the base classifier.Since the overall relevant information contained in biological sequences is ignored,there is still some room for improvement in the prediction sensitivity of this framework.(4)A 4mC site recognition framework is proposed based on word embedding and NCP,and capsule neural networks.Firstly,biological sequences are divided into nucleotide words using a sliding window word splitting approach.Subsequently,the distributed representation of the nucleotide words is trained using a word embedding model.In order to more efficiently extract correlation information around the methylation sites and high-level abstract features,capsule neural network is chosen to construct the recognition task.The experimental results show that the word2vec embedding model and NCP construct the feature matrix,and the capsule neural network is used as the classifier to achieve more excellent results.This model achieves more excellent performance on the independent test set than previous methods.
Keywords/Search Tags:Ensemble learning, Deep learning, Word embedding, Enhancer, N4-methylcytosine site
PDF Full Text Request
Related items