Font Size: a A A

Optimizing Top-k String Similarity Join Algorithm

Posted on:2017-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:L H YangFull Text:PDF
GTID:2308330503453763Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Similarity join is widely used in data cleaning, data integration and the detection of near duplicate webpages. There are many methods to measure the similarity, including Jaccard, Cosine, Dice and Hamming. In this paper, we mainly focus on the research of the similarity join between the Jaccard and Cosine.Existing similarity join algorithms fall into two categories: the threshold-based similarity join and the Top-k similarity join. The threshold-based similarity join is that it gives a threshold θ and returns pairs of string whose threshold is not less than θ; The Top-k similarity join is that its threshold is unknown and we need to find the most k’s similar pairs of string form the collection.Top-k similarity join is suitable for applications in which the threshold is unknown in advance. The most efficient Top-k similarity join algorithm is Topk-join, which is proposed by Xiao et al. But the problem of the algorithm is that: In order to reduce the redundant computation, every candidate record generated in the process of the algorithm needs to find hash table, which leads to more hash lookup cost; Each prefix event only deals with a Token of the record, making the temporary result set close to the true result is slower, and the maintenance cost of the prefix event and the temporary result heap is too large.In order to resolve the performance problems of Topk-join, a novel Top-k similarity join algorithm Opt-join is proposed in this paper. By integrating the token batch processing technique into the existing event-driven framework, Opt-join reduces the cost of processing the prefix events. By improving the performance of the maxdepth depth optimization algorithm and given the qualitative analysis of the selection of maxdepth. Also, Opt-join reduces the cost in hash lookup by switching the positions of the the hash lookup and filtering operations. The correctness of the new algorithm is proved.Experimental results show that 1.28x-3.09 x speed-up is achieved by Opt-join compared with Topk-join. More importantly, with the increase of the record length or the k value, Opt-join surpasses Topk-join by a larger margin.
Keywords/Search Tags:Top-k similarity join, event-driven framework, token batch processing, hash lookup optimization
PDF Full Text Request
Related items