Algorithms for similarity search and clustering in large data sets

Posted on:2004-09-28

Degree:Ph.D

Type:Thesis

University:Stanford University

Candidate:Gionis, Aristides

Full Text:PDF

GTID:2468390011462288

Subject:Computer Science

Abstract/Summary:

In this thesis we present algorithms for performing similarity search and clustering in large datasets. Similarity search is necessary in order to locate relevant information in a database, and it can be used for prediction and classification. Clustering, a form of data summarization, can facilitate visualization, understanding, and knowledge extraction.; The core of our algorithms for similarity search rely on an hashing-based algorithmic technique. The idea is to create signatures for data objects so that similar objects are likely to have identical signatures. Then, similar objects in the database can be located by indexing their signatures.; We first study the problem of nearest-neighbor search in high-dimensional data. We present an efficient algorithm for the approximate version of the problem. The algorithm is designed for external memory and thus is appropriate for very large datasets. Our experimental evaluation shows that our method improves significantly over previous approaches, and that it scales well with the number of dimensions.; Then we present a study of indexing techniques that facilitate similarity queries for sets. We first show how to reduce the set-indexing problem to a related problem of indexing binary vectors, and then we introduce data structures to solve the latter problem. We perform experiments on real datasets to test the accuracy and efficiency of our algorithms and to study the trade-offs of our data structures.; We apply our set-indexing algorithm to the problem of similarity search on the Web. For the latter problem, we also we develop a methodology in order to evaluate different strategies for Web-page representation. Our experiments provide useful intuition for understanding the strategies we evaluate, and our system gives good-quality answers within small response time.; Finally we study the problem of temporal clustering. The goal is to discover clusters of similar objects so that the temporal coherence of data is respected. We show that the problem is NP-hard for dimensions greater than one and we give constant-factor-approximation algorithms. We implement our algorithms, and we evaluate their performance on synthetic and real datasets. Our experiments suggest that the proposed algorithms achieve high-quality solutions in practice.

Keywords/Search Tags:

Algorithms, Data, Similarity search, Clustering, Large, Problem

Related items

1	High-Dimensional Similarity Search for Large Datasets
2	Research On Key Technologies Of High Performance Similarity Search Algorithms And Optimization
3	Similarity Search On Large-scale High-dimensional Data
4	Hashing For Large Scale Similarity Search
5	A Classification Method For PU Problem Based On Data Distribution And Text Similarity
6	Study On Clustering For Large Data Sets And Its Applications
7	Research On Clustering Algorithms For Large-scale Complex Data
8	Research On Key Problems About Large-Scale Text Clustering
9	Similarity Search in Visual Data
10	Study On Outlier Mining Algorithms Based On Clustering