Optimizing Top-k String Similarity Join Algorithm

Posted on:2017-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:L H Yang

Full Text:PDF

GTID:2308330503453763

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Similarity join is widely used in data cleaning, data integration and the detection of near duplicate webpages. There are many methods to measure the similarity, including Jaccard, Cosine, Dice and Hamming. In this paper, we mainly focus on the research of the similarity join between the Jaccard and Cosine.Existing similarity join algorithms fall into two categories: the threshold-based similarity join and the Top-k similarity join. The threshold-based similarity join is that it gives a threshold θ and returns pairs of string whose threshold is not less than θ; The Top-k similarity join is that its threshold is unknown and we need to find the most k’s similar pairs of string form the collection.Top-k similarity join is suitable for applications in which the threshold is unknown in advance. The most efficient Top-k similarity join algorithm is Topk-join, which is proposed by Xiao et al. But the problem of the algorithm is that: In order to reduce the redundant computation, every candidate record generated in the process of the algorithm needs to find hash table, which leads to more hash lookup cost; Each prefix event only deals with a Token of the record, making the temporary result set close to the true result is slower, and the maintenance cost of the prefix event and the temporary result heap is too large.In order to resolve the performance problems of Topk-join, a novel Top-k similarity join algorithm Opt-join is proposed in this paper. By integrating the token batch processing technique into the existing event-driven framework, Opt-join reduces the cost of processing the prefix events. By improving the performance of the maxdepth depth optimization algorithm and given the qualitative analysis of the selection of maxdepth. Also, Opt-join reduces the cost in hash lookup by switching the positions of the the hash lookup and filtering operations. The correctness of the new algorithm is proved.Experimental results show that 1.28x-3.09 x speed-up is achieved by Opt-join compared with Topk-join. More importantly, with the increase of the record length or the k value, Opt-join surpasses Topk-join by a larger margin.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	The Optimization Of Hash Join Algorithm Based On KNL
2	Research On Hash Join Algorithm In DM Database
3	A Hash-based Routing Lookup Algorithm
4	Optimization Of Database Join Algorithms On DRAM/NVM-Based Hybrid Memory
5	An Example-Driven Predicate Extraction And Conflict Resolution System For Similarity Join
6	Hash Join Algorithm Design And Implementation Of A Multi-core Environment
7	Optimization of batch reactions using data-driven & knowledge-driven models: The case of asymmetric catalytic hydrogenation
8	Research On String Similarity Join Based On Filtering Verification Framework
9	Research And Realization Of Database Hash Join Algorithm On Multi-core CPUs
10	Research On Similarity Join Processing Based On Entity