Some Clustering and Classification Problems in High-Throughput Metagenomics and Cheminformatics

Posted on:2014-02-15

Degree:Ph.D

Type:Dissertation

University:University of California, Riverside

Candidate:Tanaseichuk, Olga

Full Text:PDF

GTID:1458390005997462

Subject:Computer Science

Abstract/Summary:

In this dissertation, we address three different problems in high-throughput metagenomics and cheminformatics.;(1) Metagenomics studies the genomic content of an entire microbial community by simultaneously sequencing all genomes in an environmental sample. The advent of next-generation sequencing (NGS) technologies has drastically reduced sequencing time and cost, leading to the generation of millions of sequences (reads) in a single run. An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The problem is challenging due to an unknown number of genomes and their abundance ratios, presence of repeats and sequencing errors, and the short length of NGS reads. We propose two algorithms to address these challenges. First, we present an algorithm for separating short paired-end reads from genomes with similar abundance levels. Second, we propose a method to accurately estimate the abundance levels of species. The algorithm automatically determines the number of abundance groups in a metagenomic dataset and bins the reads into these groups.;(2) NGS coupled with metagenomics has led to the rapid growth of sequence databases and enabled a new branch of microbiology called comparative metagenomics. It is a fast growing field that requires the development of novel supervised learning techniques. In particular, the problem of microbial community classification may have useful applications enabling efficient organization and search in rapidly growing metagenomic databases, detection of disease phenotypes in clinical samples, and forensic identification. We propose a novel supervised classification method for metagenomic samples that takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree.;(3) In modern drug discovery, ultra-high-throughput screening is applied to millions of drug-like compounds in one experiment. Hierarchical clustering is an important step in the drug discovery process. Standard implementations of the exact algorithm for hierarchical clustering require O( n2) time and O(n 2) memory. Even though approximate hierarchical clustering methods overcome this problem, they either rely on embedding into spaces that are not biologically sensible, or produce very low resolution hierarchical structures. We present a hybrid hierarchical clustering algorithm requiring approximately O(n n ) time and O(n n ) memory while still preserving the most desirable properties of the exact algorithm.

Keywords/Search Tags:

Metagenomics, Problem, Clustering, Algorithm, Classification

Related items

1	Study On Clustering Problem Based On Improved Cuckoo Algorithm
2	Research On Algorithm Of Clustering Deformation Problem
3	Clustering Algorithm And Application Based On Combinatorial Optimization
4	Big data biology-based predictive models for metagenomics binning
5	An Improved Algorithm For Inverse Problem Of Svms Based On Clustering
6	An Improved Algorithm For Inverse Problem Of SVMs Based On Clustering
7	Clustering, classification and segmentation of three-dimensional images
8	Computational genomic signatures and metagenomics
9	Theory And Applications Of Ant Colony Algorithm And Hybridizing Other Algorithms
10	A Clustering-based Algorithm For Vehicle Routing Problem