Font Size: a A A

Column subset selection for approximating data matrices

Posted on:2010-02-26Degree:Ph.DType:Thesis
University:Rensselaer Polytechnic InstituteCandidate:Civril, AliFull Text:PDF
GTID:2448390002989462Subject:Computer Science
Abstract/Summary:
In this thesis, we study the problem of selecting a subset of columns of a matrix so that they capture the important information contained in the matrix. We present complexity results and algorithms. The problem, in a very broad sense, asks for a "good" subset of columns of a given real matrix that provides a performance guarantee in terms of an objective function related to the spectrum of the matrix.;We first present a linear-time spectral graph drawing algorithm as a motivation which is a vast improvement over the standard quadratic-time method Classical Multidimensional Scaling (CMDS). To guarantee a fast implementation of the algorithm, it is desirable to quickly select a subset of columns of a distance matrix associated with the graph. Intuitively, in order to obtain a well conditioned submatrix, one has to choose a subset of column vectors---in a geometrical sense---such that they are as "far away" from each other as possible. We consider formalizations of this notion by studying the problem of selecting a subset of columns of size k such that it satisfies some certain orthogonality conditions. We establish the NP-hardness of a few such problems and further show that two of them do not admit PTAS. For the problem of choosing the maximum volume sub-matrix, which we call MAX-VOL, we analyze a greedy algorithm and show that it provides a 2-O(k log k ) approximation. Our analysis of the greedy heuristic is tight to within a logarithmic factor in the exponent, which we show by explicitly constructing an instance for which the greedy heuristic is 2-O (k) from optimal. Further, we show that no efficient algorithm can appreciably improve upon the greedy algorithm by proving that MAX-VOL is NP-hard to approximate within 2 -ck for some constant c. Our proof is via a reduction from the Label-Cover problem.;Our last result is a constructive solution to the low-rank matrix approximation problem which asks for a subset of columns of a matrix that captures "most" of its spectrum. Our main result is a simple greedy deterministic algorithm with guarantees on the performance while choosing a small number of columns. Specifically, our greedy algorithm chooses c columns from A with c = O k2logke 2m2A such that A-CC+A F≤1+e A-AkF, where C is the matrix composed of the c columns, C+ is the pseudo-inverse of C (CC+A is the best reconstruction of A from C), and micro(A) is a measure of the coherence in the normalized columns of A. To the best of our knowledge, this is the first deterministic algorithm with performance guarantees on the number of columns and a (1+epsilon) approximation ratio in Frobenius norm. Numerical results suggest that the performance of the algorithm might be far better than the theoretical bounds suggest.
Keywords/Search Tags:Subset, Columns, Algorithm, Matrix, Problem, Performance
Related items