Column subset selection for approximating data matrices

Posted on:2010-02-26

Degree:Ph.D

Type:Thesis

University:Rensselaer Polytechnic Institute

Candidate:Civril, Ali

Full Text:PDF

GTID:2448390002989462

Subject:Computer Science

Abstract/Summary:

In this thesis, we study the problem of selecting a subset of columns of a matrix so that they capture the important information contained in the matrix. We present complexity results and algorithms. The problem, in a very broad sense, asks for a "good" subset of columns of a given real matrix that provides a performance guarantee in terms of an objective function related to the spectrum of the matrix.;We first present a linear-time spectral graph drawing algorithm as a motivation which is a vast improvement over the standard quadratic-time method Classical Multidimensional Scaling (CMDS). To guarantee a fast implementation of the algorithm, it is desirable to quickly select a subset of columns of a distance matrix associated with the graph. Intuitively, in order to obtain a well conditioned submatrix, one has to choose a subset of column vectors---in a geometrical sense---such that they are as "far away" from each other as possible. We consider formalizations of this notion by studying the problem of selecting a subset of columns of size k such that it satisfies some certain orthogonality conditions. We establish the NP-hardness of a few such problems and further show that two of them do not admit PTAS. For the problem of choosing the maximum volume sub-matrix, which we call MAX-VOL, we analyze a greedy algorithm and show that it provides a 2-O(k log k ) approximation. Our analysis of the greedy heuristic is tight to within a logarithmic factor in the exponent, which we show by explicitly constructing an instance for which the greedy heuristic is 2-O (k) from optimal. Further, we show that no efficient algorithm can appreciably improve upon the greedy algorithm by proving that MAX-VOL is NP-hard to approximate within 2 -ck for some constant c. Our proof is via a reduction from the Label-Cover problem.;Our last result is a constructive solution to the low-rank matrix approximation problem which asks for a subset of columns of a matrix that captures "most" of its spectrum. Our main result is a simple greedy deterministic algorithm with guarantees on the performance while choosing a small number of columns. Specifically, our greedy algorithm chooses c columns from A with c = O k2logke 2m2A such that A-CC+A F≤1+e A-AkF, where C is the matrix composed of the c columns, C+ is the pseudo-inverse of C (CC+A is the best reconstruction of A from C), and micro(A) is a measure of the coherence in the normalized columns of A. To the best of our knowledge, this is the first deterministic algorithm with performance guarantees on the number of columns and a (1+epsilon) approximation ratio in Frobenius norm. Numerical results suggest that the performance of the algorithm might be far better than the theoretical bounds suggest.

Keywords/Search Tags:

Subset, Columns, Algorithm, Matrix, Problem, Performance

Related items

1	Research On DNA-base Algorithms Of Subset-product Problem
2	Optimal Feature Subset Selection For Multi-class Problem Based On The Fuzzy Extension Matrix
3	Research Of Job Scheduling Problem Of Grid Based On Performance Measured-Matrix
4	The Study Of Projection Sorting And Subset Selection In Iterative Reconstruction Algorithm Based On Cone-beam CT
5	Research Of An Improved Model On DNA Computation
6	The Improvement Of Genetic Algorithm And It's Application On Knapsack Problem
7	Research On The Applications Of Several NP Problems And Cryptography Problems Based On DNA Self-assembly
8	The Improvement Of OFFSS Algorithm And Its Implementation
9	Research And Application Of Homomorphic Encryption Algorithm Based On Integer Ring
10	Efficient alternating gradient-type algorithms for the approximate non-negative matrix factorization problem