Font Size: a A A

Big data biology-based predictive models for metagenomics binning

Posted on:2016-06-28Degree:Ph.DType:Dissertation
University:University of Massachusetts LowellCandidate:Saghir, HelalFull Text:PDF
GTID:1478390017484682Subject:Computer Engineering
Abstract/Summary:
Metagenomics is the study of microorganisms collected directly from natural environments using whole genome shotgun (WGS) sequencing. Metagenomics methods use relatively rapid sequencing of genomes which cannot be cultured in a laboratory. Metagenomics is highly affected by the generation of "big data" sets as current DNA sequencing technologies are capable of generating data faster and at a lower cost. If any field is suited for big data it is the Metagenomics field where sometimes hundreds of terabytes of data is generated from analysis. Grouping random fragments obtained from whole shotgun genome data into groups is called binning. Currently, there are two different methods of binning namely: (a) sequence similarity methods and (b) sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known reference genomes using methods such as BLAST, and MEGAN. As only a very small fraction of species are known and available in the current databases, similarity methods do not yield good results. Additionally, as a given database of organisms grows, the complexity of the database search also grows. Sequence composition methods are based on compositional features made of DNA sub-sequences, K-mers, or other genomic signature(s) such as TETRA, Phylopythia, CompostBin, and likelyBin. One of the main limitations of the sequence composition method is resolving significant large sizes of features of the resulting sub-sequence composition, k-mers.;In this dissertation, we propose five different predictive models to solve the problem of sequence binning in more accurate and efficient ways. In particular, we analyze the effect of using different sets of proposed features, as well as feature reduction methods on sequence classification accuracy. We also analyze the effect of selecting different predictive classifier models on binning prediction accuracy. We analyze and compare results obtained using k-mers, codons and amino acid sub-sequences derived from various organism conserved protein domain blocks of determined sizes. The main idea behind using amino acid block sub-sequences is that they are more biology based than k-mers or codons when used as features in the proposed predictive models. We show, for first time that with the data considered in this work amino acid sub-sequences derived from conserved protein domains give better prediction accuracy than k-mers or codons frequencies. We present comparative analysis of binning predictive models using PCA, statistical t-test, Naive Bayes classifier and Random Forest classifier. Our analysis shows that using the Random Forest classifier with varying proposed feature selection, results in better prediction accuracy than the Naive Bayes classifier. Additionally, we also show that using actual frequencies of features instead of using the existence or nonexistence of features, results in a better and more accurate sequence classification and prediction.
Keywords/Search Tags:Using, Predictive models, Metagenomics, Data, Methods, Sequence, Binning, Features
Related items