Research On String Similarity Join Algorithm

Posted on:2014-10-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2268330392464453

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

A string similarity join finds similar pairs between collections of strings. It is anessential operation in many applications, such as data integration and cleaning andcollaborative filtering, and it had attracted significant attention recently. The paper focuseson the research of finding similar pairs between two different collections of strings whichjoins with the edit distance, the main research is as follow:Firstly, existing trie_based string similarity join research methods focus on improvingthe string similarity join calculation of the same collections of strings. But, how to quicklycompute the string similarity join of the different collections of strings is not an easy job.For this problem, in this paper we analyze and summarize existing algorithms, and findthe inefficient reason.Secondly, for the problem of inefficiency in the existing methods, We propose twonovel algorithms, named Trie-TS and Trie-TSS, which use the symmetry of edit distanceto reduce redundant computation. Moreover, we propose a new optimization technique tofurther reduce the unnecessary computation to improve the overall performance.Finally, we development environment, and verify the efficiency of our algorithms inthe different data sources. The experimental results show that Trie-TSS algorithm is betterthan existing trie_based algorithms.When the number of the string collection is big enough,Trie-TS algorithm is also better. And we implement the new optimization techniquetowards the number of computation and query time to further reduce the unnecessarycomputation to improve the overall performance.

Keywords/Search Tags:

trie, the string similarity join, Trie-TS, Trie-TSS, the optimization technique

PDF Full Text Request

Related items

1	Research On The Algorithms For String Similarity Search
2	Research On String Retrieval Algorithm Based On Trie Tree
3	The Study On High-speed Algorithms Of Packet Classification And Its Implement Techniques
4	The Research Of The Comprehensive And Efficient IP Lookup Algorithm Based On Trie
5	Design And Implementation Of A Pipelined IP Lookup Structure Based On Trie
6	Research And Application Of String Approximate Matching Algorithm Based On Multivariate Information
7	Research On Efficient Index Structure And Parallelization Based On Double Array Trie
8	The Research Of The IP Routing Lookup Scheme Based On Trie
9	Research Of Personalized Search Based On Trie Tree
10	Faster Rebuilding Of Multi-bit Trie For Virtual Routers