Font Size: a A A

Efficient Processing of Set-Similarity Joins on Large Clusters

Posted on:2012-12-17Degree:Ph.DType:Thesis
University:University of California, IrvineCandidate:Vernica, RaresFull Text:PDF
GTID:2468390011466429Subject:Computer Science
Abstract/Summary:
Joining two datasets where the join condition is based on set similarity instead of equality is very important in many applications, such as social networks, master data management, plagiarism detection, and query refinement based on users' similar preferences. For instance, finding Web site members with similar interests and removing duplicate customers after company acquisitions are two examples. Additionally, many applications need to deal with a large amount of data that cannot be processed by a single machine. Using clusters of machines where algorithms are executed in parallel is a more scalable approach. Recently, scalable, shared-nothing, data-intensive computation platforms have gained attention from both industry and academia. This trend started with the MapReduce framework, and more general frameworks have been proposed recently.;In this thesis we study how to efficiently perform set-similarity joins, a.k.a. fuzzy-joins, on large shared-nothing clusters. We propose a three-stage approach for computing end-to-end set-similarity joins. The input to the approach is a set of records (or two sets of records), and the output is a set of joined records based on the desired set-similarity condition. The proposed approach efficiently partitions fuzzy-join processing across many nodes, balancing the workload and minimizing the need for data replication.;This thesis makes three contributions related to the proposed three-stage approach to compute fuzzy-joins. First, we study fuzzy-joins in the context of the popular MapReduce framework. Next, we propose a set of improvements to the MapReduce runtime model to improve its handling of fuzzy-join queries as well as other types of queries. Finally, we study fuzzy-joins in the context of a new data-intensive computing platform being developed at UC Irvine, ASTERIX. We show how to integrate fuzzy-joins into the system starting at the query language level and ending in efficient execution. Along with each of the contributions, we provide performance results from experiments on real datasets, synthetically increased in size, to study the speedup and scaleup properties of our techniques.
Keywords/Search Tags:Set-similarity joins, Data, Large
Related items