Font Size: a A A

Subspace clustering methods for high dimensional data

Posted on:2009-06-30Degree:Ph.DType:Thesis
University:University of Alberta (Canada)Candidate:Moise, GabrielaFull Text:PDF
GTID:2448390002992959Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Prominent research has shown that increasing data dimensionality results in the loss of contrast in distances between data points. Thus, clustering algorithms measuring the similarity between data points based on all features/attributes of a data set tend to break down in high dimensional spaces. In addition, not all attributes of a data set may be relevant for the clustering analysis.;In this thesis, we propose three novel techniques that advance the state-of-the-art in the subspace and projected clustering field. First, we propose a projected clustering technique P3C that (1) depends on parameters that can be set without prior knowledge about the data; (2) can effectively discover low dimensional clusters embedded in high dimensional spaces; (3) can compute disjoint or overlapping clusters. Second, we propose two extensions that make P3C the first projected clustering technique that can be applied on both numerical and categorical data, sets. Third, we propose a novel problem formulation for subspace and projected clustering that aims at extracting non-redundant, axis-parallel, statistically significant regions from the data. The problem formulation is given as an optimization problem, for which exhaustive search is not a viable solution because of computational infeasibility. Therefore, we propose an approximation algorithm, STATPC, that has the same advantageous features as P3C, but, in addition, guarantees that its solution stands out in the data in a statistical sense, and it is not just an artefact of the method.;Motivated by these observations, it has been hypothesized that data points may form clusters only when a subset of the attributes, i.e., a subspace, is considered. Furthermore, data points may belong to different clusters in different subspaces. Subspace and projected clustering techniques search for clusters of points in subsets of attributes. Subspace clustering enumerates clusters of points in all subsets of attributes, typically producing many overlapping clusters. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes.
Keywords/Search Tags:Data, Clustering, Dimensional, Subspace, Clusters, Attributes
PDF Full Text Request
Related items