Font Size: a A A

Algorithms for high dimensional data

Posted on:2015-08-05Degree:Ph.DType:Thesis
University:Princeton UniversityCandidate:Nguyen, Huy LeFull Text:PDF
GTID:2478390017993121Subject:Computer Science
Abstract/Summary:
The quest of understanding large high dimension data has always been plagued by the so-called "curse of dimensionality": algorithms regularly suffer from terrible exponential dependence on the dimension. One popular approach to deal with this problem is dimension reduction, the process of reducing high dimensional objects to lower dimensional objects or other compact representations while maintaining the objectives of the problems at hand. This philosophy manifests itself in many different forms: from analytic notions like metric embeddings, to locality sensitive hashing, and even information theoretic notions like sketches. .;A fundamental task in data processing is identifying patterns, from similar objects, to relations between different attributes of the same object. An abstract way to model similarity is via the notion of distance: two objects are similar if their distance is small, a relation of regressors is predictive if the observed dependent variable is at a small distance from the predicted value, etc. All three aforementioned notions preserve distances but with varied amounts of structure. Metric embeddings preserve the most amount of structure and hence, can be combined easily with existing algorithms. In contrast, sketches require no additional structure and therefore, are the easiest to obtain but often require new algorithms. Locality sensitive hashing only has limited structure, which is enough to enable certain applications such as nearest neighbor search. The flexibility of such a large toolkit is hugely beneficial. We will demonstrate applications of these techniques to design fast algorithms for practical questions including nearest neighbor search, linear regressions, and low rank approximation, study their limitations, and overcome those limitations with new techniques.;Specifically, in this thesis, we present the following contributions.;Fast dimension reduction for sparse data. In many applications, the input data is in high dimensions but also sparse. Examples of such inputs include columns of document-word incidence matrices in information retrieval, the person-object scoring matrix in collaborative filtering tasks, constraint matrices for geodetic survey problems, photogrammetry problems, molecular structure problems [Rice83], etc. In the last decade, subspace embedding has emerged as a useful tool for designing algorithms for such data. We present a new construction of subspace embedding with nearly optimal dimensions and embedding time nearly linear in the sparsity of the input. Our new embeddings have improved the running time of many important problems from least squares regression to low rank approximations, which are workhorses of statistical analysis. For instance, our running time for least squares regression is nearly optimal unless one can avoid generic matrix multiplication.;Lower bounds for dimension reduction. We show the embeddings above are in fact nearly optimal in many regimes of parameters. Additionally, the same techniques also yield nearly tight bounds for sparse Johnson-Lindenstrass embedding, and RIP matrices. We also give new lower bounds for dimension reduction in l1, showing that for 1+epsilon distortion, there are point sets of size n in l1 requiring n(1-1/log (1/epsilon)) dimensions. This bound is nearly optimal as one can reduce the dimensions to linear while approximately preserving all distances.;Better approximate nearest neighbor search. For approximation factor c, we give a new algorithm for approximate nearest neighbor search in Euclidean space with query time n rho and space n(1+rho) where rho = 7/(8 c2) + O(1/c3). This algorithm bypasses the limit of locality sensitive hashing, the main previous technique for approximate nearest neighbor search in high dimensions. Furthermore, the algorithm achieves faster running time and smaller space via data-dependent hashing, a method widely used in practice but eluded satisfactory mathematical explanation.
Keywords/Search Tags:Data, Dimension, Algorithms, Nearest neighbor search, Running time, Locality sensitive hashing
Related items