Font Size: a A A

Research On String Similarity Join Algorithm

Posted on:2014-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2268330392464453Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A string similarity join finds similar pairs between collections of strings. It is anessential operation in many applications, such as data integration and cleaning andcollaborative filtering, and it had attracted significant attention recently. The paper focuseson the research of finding similar pairs between two different collections of strings whichjoins with the edit distance, the main research is as follow:Firstly, existing trie_based string similarity join research methods focus on improvingthe string similarity join calculation of the same collections of strings. But, how to quicklycompute the string similarity join of the different collections of strings is not an easy job.For this problem, in this paper we analyze and summarize existing algorithms, and findthe inefficient reason.Secondly, for the problem of inefficiency in the existing methods, We propose twonovel algorithms, named Trie-TS and Trie-TSS, which use the symmetry of edit distanceto reduce redundant computation. Moreover, we propose a new optimization technique tofurther reduce the unnecessary computation to improve the overall performance.Finally, we development environment, and verify the efficiency of our algorithms inthe different data sources. The experimental results show that Trie-TSS algorithm is betterthan existing trie_based algorithms.When the number of the string collection is big enough,Trie-TS algorithm is also better. And we implement the new optimization techniquetowards the number of computation and query time to further reduce the unnecessarycomputation to improve the overall performance.
Keywords/Search Tags:trie, the string similarity join, Trie-TS, Trie-TSS, the optimization technique
PDF Full Text Request
Related items