Font Size: a A A

Metagenomic Data Binning Method Based On Attention Mechanism Of Deep Learning

Posted on:2022-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2480306773971339Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the development of metagenomic sequencing technologies,metagenomic technologies are made to be a possible and valuable solution to acquiring gene sequences from environments directly.Metagenomic data binning is an important method to identify the composition of microbial taxonomy in metagenomic sequences,and is a typical machine learning clustering or classification problem.Most of the current machine learning methods use hand-designed gene sequence features,and many researchers have proposed some effective machine learning models,especially deep learning models,which have achieved good results.However,as for representation of gene sequences,they ignore the complex semantic information,unstructured characteristics,long and variable length of gene sequences.Mean while,they only consider the content or the abundance information of the sequence,resulting in incomplete features and a decrease in accuracy.In the process of clustering,most of clustering algorithm ignore the class imbalance attribute of metagenomic data,resulting in a large room for improvement in clustering performance.According to the above problems,this paper proposes an efficient BERT-based contigs representation model-Contig BERT.The model uses the modified small BERT model to represent the contig sequence and obtain the embedding vector of the sequence.The distribution of embedding vectors is displayed by data visualization,and these sequence embedding vectors are applied to metagenomic data clustering and taxonomic classification tasks.For the clustering task,this paper proposes a Contig BERT-based whole-genome sequence recovery tool-Contig BERTRG,which uses a centroid-based iterative clustering algorithm to cluster.Mean while,this paper propose the algorithm to merge and optimize clustering results,and then compares the model with other methods on the public dataset.The experimental results verify the effectiveness of the Contig BERTRG model.In addition,this paper explores the influence of different feature representations through ablation experiments,and analyzes the reasons for the better performance of the model.For the task of taxonomic classification,this paper proposes a Contig BERT-based taxonomic classification tool-Contig BERTTC,which designs three neural network classifiers,including a feedforward neural network-based classifier,a Transformer-based classifier,and a convolutional neural network-based classifier.For improve the data class imbalance problem,this paper uses the focal loss strategy to solve it,and achieves better results than other tools on public datasets,validating the effectiveness of our method.Summarizing the work of this paper,it mainly includes two aspects:(1)The language model based on self-supervised learning is used to perform representation learning on unstructured contigs,and the Contig BERT model is constructed,which solves the problem of machine learning in feature extraction of unstructured metagenomic data.(2)A method combining centroid-based iterative clustering algorithm and a merging optimization strategy is proposed.Get better results than existing metagenomic binning tools.In this paper,language models are successfully introduced into metagenomic binning tasks,and Contig BERTRG and Contig BERTTC tools are constructed for whole-genome sequence recovery and taxonomic classification tasks respectively,providing a set of effective and robust solutions for metagenomic binning tasks.
Keywords/Search Tags:Attention mechanisms, Natural language models, Representation learning, Metagenomic data binning
PDF Full Text Request
Related items