Scalable clustering algorithms

Posted on:2006-10-24

Degree:Ph.D

Type:Thesis

University:The University of Texas at Austin

Candidate:Banerjee, Arindam

Full Text:PDF

GTID:2458390008973770

Subject:Engineering

Abstract/Summary:

Scalable clustering algorithms that can work with a wide variety of distance measures and also incorporate application specific requirements are critically important for modern day data analysis and predictive modeling. In this thesis, we propose and analyze a large class of such algorithms, evaluate their performance on benchmark datasets and investigate theoretical connections of the proposed algorithms to lossy compression and stochastic prediction.; First, a wide variety of popular centroid based clustering algorithms are unified using a large class of distance measures known as Bregman divergences. We present both hard and soft-clustering algorithms using Bregman divergences. By establishing a bijection between regular exponential family distributions and regular Bregman divergences, we note that Bregman soft clustering algorithms are equivalent to learning mixtures of exponential family distributions, but can be computationally more efficient in practice. We also design algorithms for clustering directional data that generate balanced clusters, i.e., clusters of comparable sizes, a desirable property in certain practical applications. Experimental results show that such algorithms perform well for high-dimensional problems such as text clustering.; A general framework for scaling up balanced clustering algorithms is then proposed. The framework is applicable to all the algorithms presented in this thesis as well as a wide variety of other algorithms. Extensive experimental results on benchmark datasets are provided to establish the efficacy of the proposed framework. Further, we propose a new method for evaluation and model selection for clustering that can be applied to practically any clustering algorithm. The method is applicable in a transductive setting and measures the predictive accuracy of a clustering algorithm.; A detailed analysis of the connections of rate distortion theory to the proposed clustering algorithms; in particular the Bregman clustering algorithms, is also presented. In the process, we establish some key theoretical results in rate distortion theory for Bregman divergences, special cases of which has been studied in the literature using squared Euclidean distance. Also, we generalize a widely known result in stochastic prediction by establishing that the conditional expectation is the optimal predictor of a random variable if and only if the prediction error is measured by a Bregman divergence. This results explains the fundamental reason behind the efficiency of the Bregman clustering algorithms.

Keywords/Search Tags:

Clustering algorithms, Bregman, Wide variety, Results

Related items

1	Clustering Web documents: A phrase-based method for grouping search engine results
2	Research Of XML Information Retrieval Based On Pseudo-relevance Feedback
3	Research On Semantics-Based Search Results Clustering Methods
4	On The World Wide Web Search Engine Returns The Results Of Fuzzy Clustering Study
5	Research On The XML Pseudo Relevance Feedback Technology Based On Clustering Search Results
6	L₁-norm Minimization Method Based On Bregman Iteration And Its Applications
7	Research Of Query Expansion And Search Results Clustering For Web Information Retrieval
8	A Study On The Evaluation And Improvement Of Text Clustering Results
9	Chinese Search Results Clustering Research Based On Improved STC
10	Research And Implementation Of Clustering Systems Of Web Search Results