Font Size: a A A

A Partition-based Bi-Directional Filtering-Verification Method For String Similarity Joins

Posted on:2017-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2308330503457670Subject:Software engineering
Abstract/Summary:PDF Full Text Request
A string similarity join finds similar string pairs from two sets of strings, with the similarity value of the pairs satisfies a given threshold, which can be quantified by a specified similarity metric. It is an important operation in many real-world applications, such as duplicate detection, entity resolution, collaborative filtering, data integration and cleaning. Various algorithms have been proposed to address its efficiency issues.Partition-based filter-verification methods, such as PassJoin, which can be applied to both long and short strings, are promising. This method quickly screens out possible similar string pairs(candidate set) by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification based on edit-distance.Our preliminary study shows that the candidate set produced by filtering in the descending order of string length is different from that in the ascending order, and the filtering ability in the descending order is better than that in the ascending order. Filtering in both directions, therefore, can produce a smaller minimum candidate set. To make full use of the advantages of bi-directional filtering, we propose a novel bi-directional filtering-verification mechanism. At the filtering stage, it pipelines the results from length-descending filtering to length-ascending filtering to further reduce the size of the candidate set and minimize the cost of bi-directional filtering, while selecting substring by using Multi-match-aware substring selection method based on bi-directional filtering. At the verification stage, it makes use of the two pairs of matched substrings from the bi-directional filtering to partition the target string pairs into five short substring pairs, which are verified respectively to accelerate the verification process. The experimental results show that both the pipelined bi-directional filtering and the verification method based on bi-directional filtering can improve the efficiency of the algorithm, and the proposed bi-directional filtering-verification algorithm outperforms the origin algorithm on real-world datasets.
Keywords/Search Tags:String Similarity Joins, Bi-Directional Filtering-Verification Mechanism, Filter-Verification Framework, Partition-Based Filtering Method
PDF Full Text Request
Related items