Font Size: a A A

Scalable and robust clustering and visualization for large-scale bioinformatics data

Posted on:2015-01-17Degree:Ph.DType:Dissertation
University:Indiana UniversityCandidate:Ruan, YangFull Text:PDF
GTID:1478390017992069Subject:Computer Science
Abstract/Summary:
During the past few decades, advances in the next generation of sequencing (NGS) techniques have enabled rapid analysis of the whole genetic information within a microbial community, bypassing the culturing of individual microbial species in the lab. These techniques have led to a proliferation of raw genomic data, which enables an unprecedented opportunity for data mining. To analyze a voluminous amount of bioinformatics data, a pipeline called DACIDR has been proposed. DACIDR adopts a taxonomy-independent approach to grouping these sequences into operational taxonomic units (OTUs), referred to as data clustering, and it enables visualization of the clustering result leveraging the power of parallelization and multidimensional scaling (MDS) techniques by utilizing large-scale computational resources. First, in order to observe the proximity of the sequences in a lower dimension, sequence alignment techniques are applied on each pair of sequences to generate similarity scores in a high dimension. These scores need to be assigned with weights in order to achieve an accurate result in MDS. Therefore, a robust and scalable MDS algorithm called WDA-SMACOF is proposed to address the issues of either missing distances or a non-trivial weight function. Second, the dataset with millions of sequences is usually divided into two parts: the first is processed with MDS, which has quadratic space and time complexity while the second is interpolated with approximation, resulting in a linear time complexity; this is also referred to as interpolation. In order to achieve real-time processing speed, a novel hierarchical approach has been proposed to further reduce the time complexity of interpolation to sub-linear. Thirdly, a phylogenetic tree is commonly used to demonstrate the phylogeny and evolutionary path of various organisms. A traditional way of visualizing phylogenetic tree preserves only the correlations between ancestors and their descendants. By utilizing MDS and interpolation, an algorithm called interpolative joining has been proposed to display the tree on the top of clustering, where their correlations can be intuitively observed in a 3D tree diagram called Spherical Phylogram. The optimizations in these three steps greatly reduce the time complexity of visualizing sequence clustering while increase its accuracy.
Keywords/Search Tags:Clustering, Time complexity, Data, MDS, Techniques
Related items