Font Size: a A A

Exploration Of Dimensionality Reduction For High-dimensional Data Visualization And Its Application In Biomedicine

Posted on:2017-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:W W XuFull Text:PDF
GTID:1318330485462084Subject:Digital media
Abstract/Summary:PDF Full Text Request
The rapid development and proliferation of information technology and its application in the life sciences has yielded massive high dimensional datasets which must be interrogated by researchers in a timely and efficient manner. This includes not only numerical data but also data encoded in text, images, and multimedia. The complexity of dimensionality in raw datasets poses numerous challenges, from data pre-processing to subsequent visualizations. Since visualization tools are confined to 3-dimensional space and a limited number of visualization heuristics, it is a challenge to depict every possible facet of a dataset where relationships exist. As such, the union of dimensionality reduction and information visualization technologies is a vital area of active research with the goal of overcoming the "curse of dimensionality", In order to properly visualize complex data so that it can be easily understood, dimensionality reduction techniques must first employ either linear or nonlinear approaches to project the data to a lower dimensional space. By accurately identifying the most salient themes, useful information can be liberated from complex data and explored by researchers.Conventional dimensionality reduction techniques and their abilities to visualize data have been previously compared, realizing that each method presents its own advantages and disadvantages when dealing with real world datasets of varying complexity and diversity. Indeed, not every approach is well-suited for being applied to every type of data. From a visualization perspective, these approaches have three main considerations:(1) visualized data projected onto a lower-dimensional space must consider associated compromises between proper spatial distribution and maintaining local characteristics of the original data; (2) projecting high-dimensional data onto a two-dimensional space removes information encoded in multiple facets from which data observations can be potentially related; (3) most visualization algorithms for dimensionality reduction rely on pair-wise distance metrics for measuring similarity with noted impacts on processing efficiency.To address these three concerns, this dissertation first empirically tests nonlinear dimensionality reduction techniques in manifold learning methods. In doing so it is assumed that data points are evenly distributed in a low-dimensional manifold surface, which embedding in a high-dimensional space. Three types of dimensionality reduction techniques are compared against structured geometry modeling datasets. Additionally, properties of different high-dimensional biomedical datasets projected to lower-dimensional spaces are also evaluated, based on the ease of a user exploring data relationships and patterns without an intimate understanding of the underlying data.The content of this research is structured as follows:(1) We proposed a neighborhood embedding algorithm with Laplace regularization (LA2SNE). This algorithm uses Laplace distribution to calculate the probability distribution of high-dimensional and low-dimensional space between pairwise distances in order to reduce overlap within a two-dimensional space. Symmetrical Kullback-Leibler (KL) divergence is then applied to minimize the distribution between high-dimensional data and lower-dimensional data to better preserve the underlying data in its original high-dimensional space. Subsequently, by constructing a high-dimensional space Laplacian matrix as the regularization term and adjusting a penalty coefficient, the internal structure of the projected visualization is distributed in a more coherent manner, making cluster boundaries more obvious. These methods were applied to the Swissroll dataset, as well as human microbiome data, whose resulting two-dimensional projections were both quantitatively and qualitatively assessed. The results suggest that LA2NSE methods indeed improve visual representation of these high-dimensional datasets for user exploration, by maintaining the overall structure of the data, as well as making data clusters more apparent.(2) A visualization method (L-mm t-SNE) is presented, based on a manifold regularization multiple maps extension of the conventional single two-dimensional map projection. This method addresses the issue of how multiple data facets co-occur with each other, specifically with "disease-to-phenotype" datasets. L-mm t-SNE introduces a manifold point regularization term, which forces the data to compact around centralized local similarity such that points are not distributed to different maps. Experiments demonstrate that L-mm t-SNE requires fewer two-dimensional maps with the "co-occurrence" feature datasets, while still providing adequate explanations about the multiple facets which exist.(3) A computationally efficient approach for achieving fast dimensionality reduction for visualization purpose is presented. This is accomplished by first applying augmented non-negative matrix factorization (ANMF) to the original matrix as a pre-processing step. A vantage point tree (vp-tree) is then deployed to search for an "optimum" set of similar neighboring points, and then calculate the similarities between the similar neighbors based on a probability distance. Lastly, KL divergence is applied to the projected data space. Compared to conventional visualization methods, the proposed method reduces the processing time of the high-dimensional microbiome dataset by more than 50%.In conclusion, a novel dimensionality reduction and information visualization framework suited for interactive exploration of complex high-dimensional data has been presented. It addresses three key areas in this research space by properly spacing data points to form salient clusters while still maintaining the underlying nature of the raw data, providing an optimal number of multiple maps reflecting the multi-faceted nature of the data, as well as improving the computational efficiency of the overall process.
Keywords/Search Tags:data visualization, bioinformatics, manifold learning, regularization, dimension reduction technique, the human microbiome project
PDF Full Text Request
Related items