Font Size: a A A

Efficient algorithms for large data sets of genomic sequences in microbial community analysis

Posted on:2011-05-31Degree:M.SType:Thesis
University:University of Colorado at BoulderCandidate:Knox, David AFull Text:PDF
GTID:2448390002457975Subject:Biology
Abstract/Summary:
Microbial analysis of environmental samples uses high-throughput genomic sequencing to determine the diversity and quantity of microbial species. Current sequencing techniques can produce very large data sets that are not handled by current analysis applications, necessitating the design of better approaches. This work presents three new applications: SeqCluster, ParsInsert, and PTreeView. SeqCluster groups sequences based on similarity using a hierarchical clustering method and selects a representative sequence to create operational taxonomic units (OTUs). SeqCluster also supports large distance matrixes exceeding the size of available local memory by using a custom memory management system. ParsInsert introduces an algorithm that can exploit the knowledge provided by publicly available curated phylogenetic trees to efficiently produce both a phylogenetic tree and taxonomies for unknown sequences. PTreeView is a user-friendly visualization application with a broad range of functions and capabilities supporting very large trees. The applications presented here handle hundreds of thousands of sequences efficiently for data clustering, phylogenetic tree building, taxonomic classification, and tree visualization.
Keywords/Search Tags:Sequences, Data, Large
Related items