Font Size: a A A

Extensions of Principal Components Analysis

Posted on:2010-02-27Degree:Ph.DType:Thesis
University:Georgia Institute of TechnologyCandidate:Brubaker, S. CharlesFull Text:PDF
GTID:2448390002976503Subject:Mathematics
Abstract/Summary:
Principal Components Analysis is a standard tool in data analysis, widely used in data-rich fields such as computer vision, data mining, bioinformatics, and econometrics. For a set of vectors in Rn number k < n, the method returns a subspace of dimension k whose average squared distance to that set is as small as possible. Besides saving computation by reducing the dimension, projecting to this subspace can often reveal structure that was hidden in high dimension.;This thesis considers several novel extensions of PCA, which provably reveals hidden structure where standard PCA fails to do so. First, we consider Robust PCA, which prevents a few points, possibly corrupted by an adversary, from having a large effect on the analysis. The key idea is to alternate noise removal with projection to a constant fraction of the dimensions. When applied to noisy mixture models, the algorithm finds a subspace that is close to the pair of means that are furthest apart. By choosing and testing random directions in this subspace, the algorithm finds a partitioning hyperplane that does not cut any component and then recurses on the two resulting halfspaces. This strategy yields a learning algorithm for noisy mixtures of log-concave distributions that is only slightly weaker than the noiseless result (Chap. 5).;Second, we consider Isotropic PCA, which can go beyond the first two moments in identifying "interesting" directions in data. The algorithm first makes the distribution isotropic through an affine transformation. Then the algorithm reweights the data and computes the resulting first and second moments. In effect, this simulates a non-isotropic distribution, whose moments are sometimes meaningful. In the case of a mixture of Gaussians under a Gaussian reweighting, either the first moment or the direction of maximum second moment can be used to partition the components of the mixture assuming that the components are sufficiently separated. This strategy leads to the first affine-invariant algorithm that can provably learn mixtures of Gaussians in high dimensions, improving significantly on known results (Chap. 6).;Thirdly, we define the "Subgraph Parity Tensor" of order r of a graph and reduce the problem of finding planted cliques in random graphs to the problem of finding the top principal component of this tensor (Chapter 7). This extends work by Frieze and Kannan, which considers only third order tensors. The intuition behind the result is that the entries in the block of the tensor corresponding to the clique will all have the same values, while the values in other blocks will be uncorrelated. This forces the top principal component of the tensor to "point" towards the clique. Using a previously known algorithm, the clique can be recovered.
Keywords/Search Tags:Principal, Component, Algorithm, Data, PCA, Tensor
Related items