Efficient Processing of Set-Similarity Joins on Large Clusters

Posted on:2012-12-17

Degree:Ph.D

Type:Thesis

University:University of California, Irvine

Candidate:Vernica, Rares

Full Text:PDF

GTID:2468390011466429

Subject:Computer Science

Abstract/Summary:

Joining two datasets where the join condition is based on set similarity instead of equality is very important in many applications, such as social networks, master data management, plagiarism detection, and query refinement based on users' similar preferences. For instance, finding Web site members with similar interests and removing duplicate customers after company acquisitions are two examples. Additionally, many applications need to deal with a large amount of data that cannot be processed by a single machine. Using clusters of machines where algorithms are executed in parallel is a more scalable approach. Recently, scalable, shared-nothing, data-intensive computation platforms have gained attention from both industry and academia. This trend started with the MapReduce framework, and more general frameworks have been proposed recently.;In this thesis we study how to efficiently perform set-similarity joins, a.k.a. fuzzy-joins, on large shared-nothing clusters. We propose a three-stage approach for computing end-to-end set-similarity joins. The input to the approach is a set of records (or two sets of records), and the output is a set of joined records based on the desired set-similarity condition. The proposed approach efficiently partitions fuzzy-join processing across many nodes, balancing the workload and minimizing the need for data replication.;This thesis makes three contributions related to the proposed three-stage approach to compute fuzzy-joins. First, we study fuzzy-joins in the context of the popular MapReduce framework. Next, we propose a set of improvements to the MapReduce runtime model to improve its handling of fuzzy-join queries as well as other types of queries. Finally, we study fuzzy-joins in the context of a new data-intensive computing platform being developed at UC Irvine, ASTERIX. We show how to integrate fuzzy-joins into the system starting at the query language level and ending in efficient execution. Along with each of the contributions, we provide performance results from experiments on real datasets, synthetically increased in size, to study the speedup and scaleup properties of our techniques.

Keywords/Search Tags:

Set-similarity joins, Data, Large

Related items

1	Research On Privacy-preserving Spatio-textual Similarity Joins
2	Research On Similarity Join Processing Based On Entity
3	Research On Entity Resolution Framework And Key Techniques For Big Data Integration
4	ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduc
5	Learning Over Joins
6	An Improved String Similarity Join Algorithm
7	A Partition-based Bi-Directional Filtering-Verification Method For String Similarity Joins
8	An Improved Cost Model For Spatial Joins
9	Key Technology Research Of XML Query Pattern Matching Based On Structural Joins
10	Research On The Key Technologies Of Large-Scale Real-Time Data Streams Joins