Research On Improvement Of Similarity Join In MapReduce

Posted on:2018-05-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Liu

Full Text:PDF

GTID:2348330512477210

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the era of big data,spell check,data cleaning,collaborative filtering etc have been research hot spots.Similarity join as one of the basic operations has potential value among all data analyses,meanwhile the MapReduce programming model proposed by Google has been one of the most popular parallel computing models to deal with massive data these years,however it can not provide better support for similarity joins and thus similarity join in MapReduce becomes an important and extensible research area of big data.As a classical algorithm,n-gram algorithm has been widely applied in similarity join,it adds typos factor to similarity measurement,however,due to many redundant tokens generated by n-gram algorithm,it may take extra memory and extra time to handle these redundant tokens.So some researchers improved n-gram algorithm by reducing prefix tokens called ED-Join algorithm,but this algorithm adds position factor to the token,so although its running time decreases a lot,its memory usage decreases a little.There is another shortage,these two algorithm do not implement in parallel architecture,so they can not meet massive data requirement,thus reducing redundant tokens,memory usage and time usage become serious problems to be solved especially in the era of big data.To solve these problems,the paper improves traditional n-gram algorithm and ED-Join algorithm,and proposes an improved algorithm that apply to MapReduce architecture called n-gram-imp algorithm.First of all,when partition string with n-gram-imp algorithm,we use a non-redundant sliding window instead of a redundant one,so it requires a small amount of prefix tokens in the filtering phase,and the spatial complexity is reduced compared with old ones when n is in a certain range.Secondly,n-gram-imp algorithm can generate tokens without redundancy,so we only need to screen out specific prefix tokens to do filtering called Pre-imp filtering algorithm.Finally,we implement the algorithm in MapReduce computing model,it can not only reduce running time,but also improve the efficiency of processing data.A large amount of experiments were carried out on real data set and synthetic data sets.Through the comparisons and analyses of all experimental results,we know that the proposed algorithm can perform better in generating tokens,running time and memory consumption compared with the traditional n-gram algorithm and ED-Join algorithm.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Efficient Similarity Join Over Probabilistic Data Streams Based On Earth Mover's Distance
2	The Research Of Similarity Join Based On MapReduce
3	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
4	Research Of Interval Join Method Base On MapReduce
5	Aggregate Queries On Constrained Probabilistic Similarity Join Pairs
6	Design And Optimize Big-Data Join Algorithms Using MapReduce
7	Research On String Similarity Join Method Based On Hadoop Platform
8	Design And Implementation Of Data Integration System Based-on Similarity Join
9	Research And Optimization Of Join Algorithm Based On MapReduce
10	Join Method Research Based On MapReduce