Font Size: a A A

Research On Improvement Of Similarity Join In MapReduce

Posted on:2018-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LiuFull Text:PDF
GTID:2348330512477210Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,spell check,data cleaning,collaborative filtering etc have been research hot spots.Similarity join as one of the basic operations has potential value among all data analyses,meanwhile the MapReduce programming model proposed by Google has been one of the most popular parallel computing models to deal with massive data these years,however it can not provide better support for similarity joins and thus similarity join in MapReduce becomes an important and extensible research area of big data.As a classical algorithm,n-gram algorithm has been widely applied in similarity join,it adds typos factor to similarity measurement,however,due to many redundant tokens generated by n-gram algorithm,it may take extra memory and extra time to handle these redundant tokens.So some researchers improved n-gram algorithm by reducing prefix tokens called ED-Join algorithm,but this algorithm adds position factor to the token,so although its running time decreases a lot,its memory usage decreases a little.There is another shortage,these two algorithm do not implement in parallel architecture,so they can not meet massive data requirement,thus reducing redundant tokens,memory usage and time usage become serious problems to be solved especially in the era of big data.To solve these problems,the paper improves traditional n-gram algorithm and ED-Join algorithm,and proposes an improved algorithm that apply to MapReduce architecture called n-gram-imp algorithm.First of all,when partition string with n-gram-imp algorithm,we use a non-redundant sliding window instead of a redundant one,so it requires a small amount of prefix tokens in the filtering phase,and the spatial complexity is reduced compared with old ones when n is in a certain range.Secondly,n-gram-imp algorithm can generate tokens without redundancy,so we only need to screen out specific prefix tokens to do filtering called Pre-imp filtering algorithm.Finally,we implement the algorithm in MapReduce computing model,it can not only reduce running time,but also improve the efficiency of processing data.A large amount of experiments were carried out on real data set and synthetic data sets.Through the comparisons and analyses of all experimental results,we know that the proposed algorithm can perform better in generating tokens,running time and memory consumption compared with the traditional n-gram algorithm and ED-Join algorithm.
Keywords/Search Tags:similarity join, MapReduce programming model, n-gram, ED-Join, sliding window
PDF Full Text Request
Related items