Font Size: a A A

The Binning Of Metagenomic Sequence Based On Statistical Model And Word Embedding

Posted on:2019-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2370330545483711Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the development of high-throughput sequencing,metagenomics has become an important methodology to studying the microbial communities.Metagenomics extracts genetic material directly from environmental sample,which contains sequences from many different species.The diversity and complexity of microbial communities makes it difficult to investigate the taxonomic structure in metagenomic sample.Many studies have reported on taxonomic structure study of metagenomic,essentially highlighting two different strategies:'taxonomy-dependent'supervised classification and 'taxonomy-independent' unsupervised clustering.However,'Taxonomy-dependent' studies are based on sequence alignments,which reference databases are needed for from contigs or reads to meaningful taxons.Therefore,we explored the taxonomic structure of microbial samples with'taxonomy-independent' unsupervised clustering,which also referred to as binning.The frequency vector of k-mers is one of the representation of sequence composition,which provides critical information for the binning of metagenomic sequence.Binning with k-mer composition is based on the observation that relative sequence compositions are similar across different regions of the same genome,but differ between distinct genomes.So we can bin metagenomic sequence with the dissimilarity matrix between k-mer frequency vectors.And the selection of dissimilarity measures has a significant impact on the binning result.In our study,we used statistical model and word embedding method to further investigate the taxonomic structure in metagenomics.Experiments showed that our pipeline obtained better binning results.The work of this paper includes the following two aspects.(1)We attempted to model contigs using k-mer composition under the Markov background model,followed by measuring dissimilarity between contigs using d2s.The d2s dissimilarity matrix was used to bin contigs with unsupervised clustering algorithm.On this basis,we developed a package named d2s Bin to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample.The tool is taxonomy-free and depends only on k-mer with the Markov models of the background genomes.Our experiments demonstrate that d2s Bin significantly improves binning performance in 6 datasets with 5 binning tools.(2)We applied the distributed representation and word embedding from Natural Language Processing to the binning of metagenomic sequence.As for our research,we consider the assembled sequences and k-mer as sentences and words respectively.Word embedding software word2vec was used to train the k-mer representation vector.And we selected the Weighted Removal(WR)method to get the sentence embedding of contigs.In this way,the metagenomic sequence can be represented as a low-dimension and high-density vector,which has a good performance on contigs binning,data compression and dissimilarity measurement.
Keywords/Search Tags:Metagenomics, Contig binning, d2s dissimilarity, Word embedding
PDF Full Text Request
Related items