Font Size: a A A

Fast, incremental, and scalable all pairs similarity search

Posted on:2011-10-07Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Awekar, AmitFull Text:PDF
GTID:1448390002464682Subject:Artificial Intelligence
Abstract/Summary:
Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With the emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive datasets in a high dimensional space. The overarching goal of this dissertation is to enable fast and incremental similarity search over large high dimensional datasets through improved indexing, systematic heuristic optimizations, and scalable parallelization.;In Task 1, we design a sequential algorithm for All Pairs Similarity Search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to date with up to 6X speed-up over the state-of-the-art existing APSS algorithm.;In Task 2, we address the incremental formulation of the APSS problem, where APSS is performed multiple times over a given dataset while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing computation history during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 105X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem.;In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures. We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS.;We demonstrate the effectiveness of our techniques using four record datasets.
Keywords/Search Tags:APSS, Pairs, Similarity, Search, Data, Technique, Incremental, Scalable
Related items