Font Size: a A A

Analyzing and improving local search: K-means and ICP

Posted on:2010-03-10Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Arthur, DavidFull Text:PDF
GTID:1448390002471855Subject:Computer Science
Abstract/Summary:
Lloyd's method (also known as the k-means method) and Iterative Closest Point are two prominent algorithms that use local search to sacrifice accuracy guarantees for speed and simplicity. On the large-scale data sets that arise in practice, this trade-off is appealing, and both algorithms are very widely used.;Previously however, little was known about the worst-case running time of either algorithm, and there was no satisfactory explanation as to why they are fast in practice. In this work, we begin by showing that worst-case complexity cannot offer such an explanation because surprisingly, both algorithms have super-polynomial worst-case running time. To reconcile this shortcoming with the algorithms' observed speed, we turn to the model of smoothed analysis, where running time is measured only after adding a small amount of random noise to the input. We show that in this model, both Lloyd's method and Iterative Closest Point have polynomial complexity. This implies any worst-case scenario requires a pathological placement of points, which is extremely unlikely in practice.;We also observe that Lloyd's method, for all of its advantages, can give arbitrarily bad output even on real-world data sets. We introduce k-means++, which tweaks the initialization procedure for Lloyd's method to get an algorithm with provable accuracy guarantees. Experiments show k-means++ is both faster and more accurate than Lloyd's method in practice, often by considerable margins. We believe this is the first k-means clustering algorithm that combines theoretical guarantees with positive experimental results.
Keywords/Search Tags:K-means, Lloyd's method, Algorithm
Related items