Font Size: a A A

Research And Implementation Of High Efficiency Set Similarity Join Algorithm Based On Overlap Similarity

Posted on:2021-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:P X HuaFull Text:PDF
GTID:2518306104488324Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology and the arrival of the era of big data,more and more heterogeneous data are produced and stored.In order to use these highly variety data simply and efficiently,it is essential to integrate data from heterogeneous systems.And the set similarity join is core functionality in data integration.Set similarity join aims to find of all pairs whose similarities are greater than the given similarity threshold or the most similar top k pairs under the given similarity function(such as Overlap similarity,Jaccard similarity,Cosine similarity,Dice similarity or Edit distance etc.).Based on the in-depth observation and analysis of the classic algorithm Topk-Join,it is found that Topk-Join only accesses one element in each iteration,and if it accesses multiple elements each time,it can effectively improve the performance,the number of elements accessed each time is called step size.Because different step size has different influence on the performance of Topk-Join,the Fixed-Step-Size algorithm obtains better performance than Topk-Join by properly increasing the step size.Because the characteristics of different data collections are also different,a fixed step size can not be applied to all data collections.According to the characteristics of data collections,the Variable-Step-Size algorithm automatically changes its step size in each iteration,so that it can be applied to all data collections,and obtains better performance than Topk-Join.Through a large number of experiments on real large-scale datasets,it is proved that the Variable-Step-Size algorithm and the Variable-Step-Size algorithm are much better than the existing algorithm in running time,and the size of inverted index and candidate set has not changed much,but the proportion of verification process in the total time is larger.It shows that the Variable-Step-Size algorithm and the Variable-Step-Size algorithm can greatly accelerate the generation of candidate sets.
Keywords/Search Tags:Data Integration, Overlap Similarity, Set Similarity Join, Topk-Join, FixedStep-Size algorithm, Variable-Step-Size algorithm
PDF Full Text Request
Related items