Font Size: a A A

Researches On The Sequence-Feature-Based Metagenomic Data Analysis Methods

Posted on:2017-02-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X DingFull Text:PDF
GTID:1220330491462034Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Metagenome, also known as environmental genome, is a collection of all microbial genomes in an environment. Therefore metagenomics is the study of genetic material recovered directly from environmental samples. Because less than 1% microbes can be cultured independently, metagenomics has gradually entered the mainstream of microbial research methods. The primary task of metagenomics is to analyse the species diversity of microbial communities, and further to investigate what functions do the inside microbes have and how do the microorganisms perform their functions. In addition, another important job of metagenomics is to resolve the biomedical problems based on the researches of the difference between metagenomic samples which extract from different microbial communities. Sequence feature is one of the most important representations of DNA sequence. Therefore, the microbial genomes and gene sequences can be distinguished accurately by sequence feature. The high-throughput sequencing technology accelerates the process of massive large metagenomic projects, yielding a mass of metagenomic sequencing data. The requirement of high-efficiency and accurate metagenomic data analysis methods is extremely urgent.The metagenomic data analysis methods are generally divided into two classes, the alignment-based analysis method and the sequence-feature-based (also named the alignment-free) analysis method. The alignment-based method maps the sequences to databases with the alignment sotewares. The accuracy of the method lies on the completeness of databases. However, sequence feature extracts from sequence directly, which investgates the information of composition and correlation inside the genomic sequence. Therefore, it can distinguish different genomes and genomic functional components specifically. Comparing with the alignment-based method, the sequence-feature-based method is independent of the microbial databases, not only analyses the metagenomic data accurately, but can also explore the unknown information. In this dissertation, a model of sequence feature is proposed around the characteristics of genomic sequence, which can distinguish microbial species. The metagenomic data analysis methods are investigated based on the model additionally. Detailed contributions of this work can be summarized as follows.1. Sequence statistic feature can be divided into sequence composition and sequence correlation based on the statistical strategy. Sequence composition calculates the content of different parts in genomic sequence. Sequence correlation represents the relationship among different components in genomes. In this dissertation, a kind of sequence correlation, ICO (Intrinsic Correlation of Oligonucleotides) is proposed based on the odds ratio from the statistics and the mutual information from the information theory, which presents the correlation between two consecutive parts in an oligonucleotide. The ICO vector profiles of different genomes and genomic fragments which come from different genomes indicate that ICO can not only distinguish the microbial genomes, but also identify the origin of fragments accurately. Additionally, the performance of discrimination on microbial genomes of the entire ICO, the two parts of ICO and the frequency of 4-mer are also investigated with a statistical method. Results demonstrate each single part of ICO has respective advantage for distinguishing different genomes. The entire ICO outperforms the single part of ICO and the frequency of 4-mer. Finally, the ability of identifying DNA fragments of ICO is also evaluated based on the inter-genomic and intra-genomic relational grades. The results indicate that ICO can discriminate the origin of fragments more accurate comparing with the frequency of 4-mer. In conclusion, ICO has well performance on distinguishing the microbial genomes and identifying the DNA fragments by extracting information from genome deeply.2. Metagenome is a collection of fragments sequencing from the environmental sample directly, which contains a mass of DNA sequences that have different origins. The metagenomic binning algorithm is used for classifying the sequences in terms of their origins. In this study, we propose HSS-bin, an unsupervised metagenomic binning algorithm based on the hybrid feature with the ICO of 4-mer and the frequency of 4-mer, which employs the spectral clustering algorithm. The binning performances of HSS-bin are evaluated on several simulated metagenomic datasets and one actual metagenomic dataset. Experimental results demonstrate that HSS-bin performed better than single-feature-based algorithm on the metagenomic datasets with short sequences, uniform species abundance and multiple species, which makes up the weakness of the single-feature-based binning algorithm. In addition, the spectral clustering algorithm promotes the performance of binning algorithm obviously. HSS-bin also has better binning performances compared with the widely used unsupervised binning algorithms:MetaCluster and LikelyBin. In terms of the actual metagenomic dataset, the binning accuracy of HSS-bin exceeds MetaCluster and LikelyBin 38.1% and 31.18% respectively. Therefore, the combination of sequence composition and sequence correlation, the application of spectral clustering algorithm both provide new insights for designing the metagenomic binning algorithm.3. Metagenomics has entered into the era of big data, yielding some new aspects of metagenomic researches. The diversity of single metagenome is not the only focus. More investigations are concentrated on the diversity among metagenomes. In this dissertation, we propose DectICO, an alignment-free supervised metagenomic sample classification algorithm. It employs the ICO based on long oligonucleotide and dynamic KPLS (Kernel Partial Least Squares) feature selection algorithm, which performs the classification with the SVM algorithm. Three actual metagenomic sequencing datasets with different sequencing depth are classified in order to evaluate the classification performances of DectICO. Experimental results indicate that DectICO significantly outperforms the sequence-composition-based method based on long oligonucleotides. And the superior classification performance becomes more obvious as the oligonucleotides get longer. In addition, selecting the ICO with KPLS algorithm dynamically can promote the classification performance significantly. Finally, the comparisons of classification performance between DectICO and the RSVM-based (Recursive Support Vector Machine) classification algorithm are also performed. Results demonstrate that DectICO classifies metagenomic samples more accurately than the RSVM-based method with a set of completely labeled samples as training set, both for the metagenomic datasets with low and deep sequencing depth. Additionally, our method has better stability and generality than the RSVM-based method. In summary, the proposed metagenomic sample classification algorithm can accurately classify metagenomic samples with different status, providing the theoretical basis for the researches of the detection of disease phenotype of clinical sample, medicolegal expertise and environmental pollution and so on.
Keywords/Search Tags:microbe, metagenome, sequence feature, machine learning, binning algorithm, sample classification algorithm
PDF Full Text Request
Related items