Font Size: a A A

Research On String Edit Similarity Join

Posted on:2013-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:H CaoFull Text:PDF
GTID:2268330392467999Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
String similarity join has become an essential operator in many applications wherenear-duplicate objects need to be found, such as coalition detection, fuzzy keyword match-ing, data integration, data cleaning. This paper focuses on string similarity join with editdistance constraints which measures the similarity of two strings by the minimum numberof edit operators (insertion, deletion, and substitution of single characters) to transformone string to the other.This paper takes the frequencies of single characters as well as other statistics asglobal information of strings. Specifically, a novel partition-based algorithm is develope-d to utilize such information to enumerate smaller candidate set in a more efcient wayby partitioning dataset into small chunks. Some new filter is proposed to leverage thefrequencies of single characters. It has a low complexity and can prune away many can-didate pairs which pass through the existing filters. We experimentally verify the superiorefciency of our algorithm to alternative methods, using real datasets.Based on disk algorithm, we also implement a disk-algorithm framework. The diskscheduling problem is proved to be NP-complete, and several heuristics are proposedto solve this problem. The incremental computation, via the proposed disk algorithmframework, is also discussed. Experiments verified the key idea in this paper, which laidthe foundation for future disk algorithms.
Keywords/Search Tags:string similarity join, edit distance, frequent vector, data partition
PDF Full Text Request
Related items