Fast, incremental, and scalable all pairs similarity search

Posted on:2011-10-07

Degree:Ph.D

Type:Dissertation

University:North Carolina State University

Candidate:Awekar, Amit

Full Text:PDF

GTID:1448390002464682

Subject:Artificial Intelligence

Abstract/Summary:

Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With the emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive datasets in a high dimensional space. The overarching goal of this dissertation is to enable fast and incremental similarity search over large high dimensional datasets through improved indexing, systematic heuristic optimizations, and scalable parallelization.;In Task 1, we design a sequential algorithm for All Pairs Similarity Search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to date with up to 6X speed-up over the state-of-the-art existing APSS algorithm.;In Task 2, we address the incremental formulation of the APSS problem, where APSS is performed multiple times over a given dataset while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing computation history during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 105X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem.;In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures. We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS.;We demonstrate the effectiveness of our techniques using four record datasets.

Keywords/Search Tags:

Related items

1	Multi-input multi-output APSS for rotor angle stability
2	Research On Similarity Search Of High Dimensional Data Based On Hash Technique
3	Research Of Parallel All Pairs Similarity Search On CUDA
4	Study On Two Complementary Sequence Pairs And The Best Four Element Array Pairs
5	The Scalable and Accountable Binary Code Search And its Application
6	Study On Similarity Analysis Of Event Streams In Real-time Monitored Applications
7	Efficient similarity search in high-dimensional data spaces
8	Design And Implementation Of Search Engine System Based On The Incremental Crawler
9	Research On Key Technique Of Reachability Queries Processing Within K-steps On Graph Data
10	Study On Similarity Search For Textual And Spatial Data