Scalable and robust clustering and visualization for large-scale bioinformatics data

Posted on:2015-01-17

Degree:Ph.D

Type:Dissertation

University:Indiana University

Candidate:Ruan, Yang

Full Text:PDF

GTID:1478390017992069

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

During the past few decades, advances in the next generation of sequencing (NGS) techniques have enabled rapid analysis of the whole genetic information within a microbial community, bypassing the culturing of individual microbial species in the lab. These techniques have led to a proliferation of raw genomic data, which enables an unprecedented opportunity for data mining. To analyze a voluminous amount of bioinformatics data, a pipeline called DACIDR has been proposed. DACIDR adopts a taxonomy-independent approach to grouping these sequences into operational taxonomic units (OTUs), referred to as data clustering, and it enables visualization of the clustering result leveraging the power of parallelization and multidimensional scaling (MDS) techniques by utilizing large-scale computational resources. First, in order to observe the proximity of the sequences in a lower dimension, sequence alignment techniques are applied on each pair of sequences to generate similarity scores in a high dimension. These scores need to be assigned with weights in order to achieve an accurate result in MDS. Therefore, a robust and scalable MDS algorithm called WDA-SMACOF is proposed to address the issues of either missing distances or a non-trivial weight function. Second, the dataset with millions of sequences is usually divided into two parts: the first is processed with MDS, which has quadratic space and time complexity while the second is interpolated with approximation, resulting in a linear time complexity; this is also referred to as interpolation. In order to achieve real-time processing speed, a novel hierarchical approach has been proposed to further reduce the time complexity of interpolation to sub-linear. Thirdly, a phylogenetic tree is commonly used to demonstrate the phylogeny and evolutionary path of various organisms. A traditional way of visualizing phylogenetic tree preserves only the correlations between ancestors and their descendants. By utilizing MDS and interpolation, an algorithm called interpolative joining has been proposed to display the tree on the top of clustering, where their correlations can be intuitively observed in a 3D tree diagram called Spherical Phylogram. The optimizations in these three steps greatly reduce the time complexity of visualizing sequence clustering while increase its accuracy.

Keywords/Search Tags:

Clustering, Time complexity, Data, MDS, Techniques

PDF Full Text Request

Related items

1	Research On Complexity-oriented Spatial Data Clustering Analysis Methods
2	Study On Clustering For Large Data Sets And Its Applications
3	Study Of Kolmogorov Complexity Based Clustering Algorithms
4	Parallel Design And Implementation Of AP Clustering Algorithms Based On CUDA
5	The Effects Of Data Imbalance On The Performance Of Data Complexity Measures
6	Evolution-based Clustering Algorithms For Time-Series Data And Their Applications
7	Multiple Visual Features Based Image Complexity Evaluation And Applications
8	Clustering techniques for data mining and protein design around the concept of locality
9	An Improved Algorithm For Inverse Problem Of Svms Based On Clustering
10	An Improved Algorithm For Inverse Problem Of SVMs Based On Clustering