Efficient algorithms for large data sets of genomic sequences in microbial community analysis

Posted on:2011-05-31

Degree:M.S

Type:Thesis

University:University of Colorado at Boulder

Candidate:Knox, David A

Full Text:PDF

GTID:2448390002457975

Subject:Biology

Abstract/Summary:

Microbial analysis of environmental samples uses high-throughput genomic sequencing to determine the diversity and quantity of microbial species. Current sequencing techniques can produce very large data sets that are not handled by current analysis applications, necessitating the design of better approaches. This work presents three new applications: SeqCluster, ParsInsert, and PTreeView. SeqCluster groups sequences based on similarity using a hierarchical clustering method and selects a representative sequence to create operational taxonomic units (OTUs). SeqCluster also supports large distance matrixes exceeding the size of available local memory by using a custom memory management system. ParsInsert introduces an algorithm that can exploit the knowledge provided by publicly available curated phylogenetic trees to efficiently produce both a phylogenetic tree and taxonomies for unknown sequences. PTreeView is a user-friendly visualization application with a broad range of functions and capabilities supporting very large trees. The applications presented here handle hundreds of thousands of sequences efficiently for data clustering, phylogenetic tree building, taxonomic classification, and tree visualization.

Keywords/Search Tags:

Sequences, Data, Large

Related items

1	Convergence Properties Of Mixing Random Sequences
2	The Study Of Sequences With Low(ODD) Even Correlation
3	Depth Analysis, Data Mining Web Access Logs
4	Analysis On Construction And Randomness Of Pseudo-random Sequences
5	Research On Some Properties Of Primitive σ -LFSR Sequences
6	Reconstructing Truncated Sequences Derived From Primitive Sequences Over Integer Residue Rings
7	Correlation Research On Pseudo-Random Sequences
8	Construction And Properties Of GMV Sequences And Component Sequences
9	Construction And Analysis On Two Pseudo-Random Sequences
10	Sequences Design For Wireless Communication System