Research On String Edit Similarity Join

Posted on:2013-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:H Cao

Full Text:PDF

GTID:2268330392467999

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

String similarity join has become an essential operator in many applications wherenear-duplicate objects need to be found, such as coalition detection, fuzzy keyword match-ing, data integration, data cleaning. This paper focuses on string similarity join with editdistance constraints which measures the similarity of two strings by the minimum numberof edit operators (insertion, deletion, and substitution of single characters) to transformone string to the other.This paper takes the frequencies of single characters as well as other statistics asglobal information of strings. Specifically, a novel partition-based algorithm is develope-d to utilize such information to enumerate smaller candidate set in a more efcient wayby partitioning dataset into small chunks. Some new filter is proposed to leverage thefrequencies of single characters. It has a low complexity and can prune away many can-didate pairs which pass through the existing filters. We experimentally verify the superiorefciency of our algorithm to alternative methods, using real datasets.Based on disk algorithm, we also implement a disk-algorithm framework. The diskscheduling problem is proved to be NP-complete, and several heuristics are proposedto solve this problem. The incremental computation, via the proposed disk algorithmframework, is also discussed. Experiments verified the key idea in this paper, which laidthe foundation for future disk algorithms.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research And Application On Distributed Parallel String Similarity Join
2	An Improved String Similarity Join Algorithm
3	Research On Similarity Query Over Sequence Data
4	Sequentially Matching Similarity String Algorithm Research
5	Research On The Algorithms For String Similarity Search
6	Efficient Approximate String Matching With Edit Distance Constraint
7	Research On String Similarity Search Algorithms
8	An Algorithm Of Computing String Similarity Based On Improved Levenshtein Distance
9	Research On Top-k String Similarity Search Based On Edit Distance
10	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques