Font Size: a A A

Exploration of High-Dimensional Data: Theory and Practice of Organizing and Analyzing Data in High-Dimensional Spaces

Posted on:2015-06-06Degree:Ph.DType:Thesis
University:University of Massachusetts LowellCandidate:Lu, JianFull Text:PDF
GTID:2478390017993583Subject:Computer Science
Abstract/Summary:
The information helps people every day with making better decisions, working more productively and living fuller lives. Today, within the increasing kinds and quantities of the information, the term "big data" draws a lot of attentions. However, few people are able to understand the story behind. For decades, companies have been collecting, analyzing and storing data for making decisions. Due to hardware (both storage and computing) limitations, data have been collected in the way of "Analyze First, Store Second". The collected data are first transferred to the potentially useful and well-structured data, called critical data. The critical data are then stored in the traditional databases (i.e. relational databases) and mined for useful information. The structured critical data can save storage and computing resources, but may lost some valuable information. Recently, this kind of method has been thrown away because of the fast development of storage and computing power. In the rush of information, companies are collecting data as much as possible without any analysis, and massive data are directly stored in non-traditional databases. This new strategy of collecting data is referred to "Store First, Analyze Second".;In the big data era, data are scaling rapidly and often not well structured. In the past we thought of data as observations of several features (e.g. height, geographic coordinate, RGB color...), so it is easy to transfer and store raw data as critical data. The trend today is towards data observations with a large number of features. Following the "Store First" strategy, data are stored in the databases no matter they are valuable or not, so we are often seeing examples where a single observation has dimensions in the hundreds or more fields, like financial tick-by-tick data, spectra data, sensor data, DNA microarray data, and etc. We call the data with a large number of dimensions (features, fields, attributes, or columns) high-dimensional data. And we can say confidently that processing high-dimensional data will be very significant in the age of big data.;This thesis aims on the challenges in the research of high-dimension data, including examining the curses of dimensionality, organizing high-dimensional data in traditional databases using indexing techniques, and practical applications of searching and analyzing high-dimensional data. When the dimensionality becomes higher (i.e. larger than 3), many efficient methods of the lower dimensionality become inefficient in the high-dimensional spaces, and some phenomena arise while processing high-dimensional data which do not occur in lower-dimensional setting, so-called "curse of dimensionality". The general phenomenon of "curse of dimensionality" is that, when the dimensionality increases, the hyper-volume of the space increases so fast that the available data in the space becomes spares. In the domain of database, we refer it to the intractability of indexing and searching through a high-dimension space. In the domain of data mining, the problem is that, within a fixed number of training samples, the predictive power reduces as the dimensionality increases. In the dissertation, we provide several theoretical and practical methodologies to serve organizing and analyzing high-dimensional data.
Keywords/Search Tags:Data, Analyzing, Organizing, Dimensionality, Information, Space
Related items