Big data biology-based predictive models for metagenomics binning

Posted on:2016-06-28

Degree:Ph.D

Type:Dissertation

University:University of Massachusetts Lowell

Candidate:Saghir, Helal

Full Text:PDF

GTID:1478390017484682

Subject:Computer Engineering

Abstract/Summary:

Metagenomics is the study of microorganisms collected directly from natural environments using whole genome shotgun (WGS) sequencing. Metagenomics methods use relatively rapid sequencing of genomes which cannot be cultured in a laboratory. Metagenomics is highly affected by the generation of "big data" sets as current DNA sequencing technologies are capable of generating data faster and at a lower cost. If any field is suited for big data it is the Metagenomics field where sometimes hundreds of terabytes of data is generated from analysis. Grouping random fragments obtained from whole shotgun genome data into groups is called binning. Currently, there are two different methods of binning namely: (a) sequence similarity methods and (b) sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known reference genomes using methods such as BLAST, and MEGAN. As only a very small fraction of species are known and available in the current databases, similarity methods do not yield good results. Additionally, as a given database of organisms grows, the complexity of the database search also grows. Sequence composition methods are based on compositional features made of DNA sub-sequences, K-mers, or other genomic signature(s) such as TETRA, Phylopythia, CompostBin, and likelyBin. One of the main limitations of the sequence composition method is resolving significant large sizes of features of the resulting sub-sequence composition, k-mers.;In this dissertation, we propose five different predictive models to solve the problem of sequence binning in more accurate and efficient ways. In particular, we analyze the effect of using different sets of proposed features, as well as feature reduction methods on sequence classification accuracy. We also analyze the effect of selecting different predictive classifier models on binning prediction accuracy. We analyze and compare results obtained using k-mers, codons and amino acid sub-sequences derived from various organism conserved protein domain blocks of determined sizes. The main idea behind using amino acid block sub-sequences is that they are more biology based than k-mers or codons when used as features in the proposed predictive models. We show, for first time that with the data considered in this work amino acid sub-sequences derived from conserved protein domains give better prediction accuracy than k-mers or codons frequencies. We present comparative analysis of binning predictive models using PCA, statistical t-test, Naive Bayes classifier and Random Forest classifier. Our analysis shows that using the Random Forest classifier with varying proposed feature selection, results in better prediction accuracy than the Naive Bayes classifier. Additionally, we also show that using actual frequencies of features instead of using the existence or nonexistence of features, results in a better and more accurate sequence classification and prediction.

Keywords/Search Tags:

Using, Predictive models, Metagenomics, Data, Methods, Sequence, Binning, Features

Related items

1	Neural network based movement models to improve the predictive utility of entity state synchronization methods for distributed simulations
2	Aging Predictive Models and Simulation Methods for Analog and Mixed-Signal Circuits
3	Using Random Projection Technology To Find Biological Sequence Features Of The Algorithm
4	Some Clustering and Classification Problems in High-Throughput Metagenomics and Cheminformatics
5	Research On Metagenomic Sequence Binning Algorithm Based On Feature Vectors
6	Inference, orthology, and inundation: Addressing current challenges in the field of metagenomics
7	Learning predictive models from massive, semantically disparate data
8	GPR methods for the detection and characterization of fractures and karst features: Polarimetry, attribute extraction, inverse modeling and data mining techniques
9	Use of macroinvertebrate predictive models to evaluate the stream restoration effect
10	Research On Predictive Query Over Data Streams