Font Size: a A A

Scalable clustering algorithms

Posted on:2006-10-24Degree:Ph.DType:Thesis
University:The University of Texas at AustinCandidate:Banerjee, ArindamFull Text:PDF
GTID:2458390008973770Subject:Engineering
Abstract/Summary:
Scalable clustering algorithms that can work with a wide variety of distance measures and also incorporate application specific requirements are critically important for modern day data analysis and predictive modeling. In this thesis, we propose and analyze a large class of such algorithms, evaluate their performance on benchmark datasets and investigate theoretical connections of the proposed algorithms to lossy compression and stochastic prediction.; First, a wide variety of popular centroid based clustering algorithms are unified using a large class of distance measures known as Bregman divergences. We present both hard and soft-clustering algorithms using Bregman divergences. By establishing a bijection between regular exponential family distributions and regular Bregman divergences, we note that Bregman soft clustering algorithms are equivalent to learning mixtures of exponential family distributions, but can be computationally more efficient in practice. We also design algorithms for clustering directional data that generate balanced clusters, i.e., clusters of comparable sizes, a desirable property in certain practical applications. Experimental results show that such algorithms perform well for high-dimensional problems such as text clustering.; A general framework for scaling up balanced clustering algorithms is then proposed. The framework is applicable to all the algorithms presented in this thesis as well as a wide variety of other algorithms. Extensive experimental results on benchmark datasets are provided to establish the efficacy of the proposed framework. Further, we propose a new method for evaluation and model selection for clustering that can be applied to practically any clustering algorithm. The method is applicable in a transductive setting and measures the predictive accuracy of a clustering algorithm.; A detailed analysis of the connections of rate distortion theory to the proposed clustering algorithms; in particular the Bregman clustering algorithms, is also presented. In the process, we establish some key theoretical results in rate distortion theory for Bregman divergences, special cases of which has been studied in the literature using squared Euclidean distance. Also, we generalize a widely known result in stochastic prediction by establishing that the conditional expectation is the optimal predictor of a random variable if and only if the prediction error is measured by a Bregman divergence. This results explains the fundamental reason behind the efficiency of the Bregman clustering algorithms.
Keywords/Search Tags:Clustering algorithms, Bregman, Wide variety, Results
Related items