Font Size: a A A

Algorithms for similarity search and clustering in large data sets

Posted on:2004-09-28Degree:Ph.DType:Thesis
University:Stanford UniversityCandidate:Gionis, AristidesFull Text:PDF
GTID:2468390011462288Subject:Computer Science
Abstract/Summary:
In this thesis we present algorithms for performing similarity search and clustering in large datasets. Similarity search is necessary in order to locate relevant information in a database, and it can be used for prediction and classification. Clustering, a form of data summarization, can facilitate visualization, understanding, and knowledge extraction.; The core of our algorithms for similarity search rely on an hashing-based algorithmic technique. The idea is to create signatures for data objects so that similar objects are likely to have identical signatures. Then, similar objects in the database can be located by indexing their signatures.; We first study the problem of nearest-neighbor search in high-dimensional data. We present an efficient algorithm for the approximate version of the problem. The algorithm is designed for external memory and thus is appropriate for very large datasets. Our experimental evaluation shows that our method improves significantly over previous approaches, and that it scales well with the number of dimensions.; Then we present a study of indexing techniques that facilitate similarity queries for sets. We first show how to reduce the set-indexing problem to a related problem of indexing binary vectors, and then we introduce data structures to solve the latter problem. We perform experiments on real datasets to test the accuracy and efficiency of our algorithms and to study the trade-offs of our data structures.; We apply our set-indexing algorithm to the problem of similarity search on the Web. For the latter problem, we also we develop a methodology in order to evaluate different strategies for Web-page representation. Our experiments provide useful intuition for understanding the strategies we evaluate, and our system gives good-quality answers within small response time.; Finally we study the problem of temporal clustering. The goal is to discover clusters of similar objects so that the temporal coherence of data is respected. We show that the problem is NP-hard for dimensions greater than one and we give constant-factor-approximation algorithms. We implement our algorithms, and we evaluate their performance on synthetic and real datasets. Our experiments suggest that the proposed algorithms achieve high-quality solutions in practice.
Keywords/Search Tags:Algorithms, Data, Similarity search, Clustering, Large, Problem
Related items