ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduc

Posted on:2014-11-01

Degree:M.E

Type:Thesis

University:The University of Toledo

Candidate:Lakshminarayanan, Mahalakshmi

Full Text:PDF

GTID:2458390005497585

Subject:Computer Science

Abstract/Summary:

Similarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.;This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need.;In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file.;In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies.;SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.

Keywords/Search Tags:

Related items

1	Research On Collaborative Filtering Recommended Algorithm And Implementation Of MapReduce
2	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
3	Research Of Scalable Collaborative Filtering Algorithm Based On Mapreduce
4	Research On Techniques Of Similarity-based Distributed Duplication Elimination
5	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
6	Research On Clustering Collaborative Filtering Recommendation Algorithm Based On MapReduce
7	Research On Entity Resolution Framework And Key Techniques For Big Data Integration
8	A Hybrid Recommendation Algorithm Based On MapReduce
9	The Research Of KNN Classification Algorithm For Mass Text Based On MapReduce
10	Research And Application Of Collaborative Filtering Recommendation Algorithm Based On User-similarity In Social Networks