Research On String Similarity Join Based On Filtering Verification Framework

Posted on:2021-03-28

Degree:Master

Type:Thesis

Country:China

Candidate:R Chen

Full Text:PDF

GTID:2428330647961941

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As the basic operation of data processing,similarity join is widely used in the fields of data cleaning,similarity web search,bio-informatics and so on.However,the continuous growth and change of data size and data type bring challenges to similarity join.Although some existing string similarity join algorithms based on filtering verification framework can improve the efficiency of similar join,there are still some problems such as insufficient filtering ability and not well adapted to the string sets with large differences in length.Therefore,in order to meet the requirements of high efficiency and high precision of similarity join algorithm,this paper mainly studied the string similarity join algorithm based on filtering verification framework.The main work and contributions are as follows:In view of the fact that the filtering capability of the existing algorithm is gradually unable to meet the needs of large data sets,a double filtering string similarity join algorithm based on partition(DFP-ssj)is designed.The algorithm first divides the string set into several subsets.Then the length filtering method is used to reduce the time consumption of the algorithm in the process of similar join between and within each subset.In order to further improve the filtering ability,the prefix filtering algorithm is used in the algorithm to delete the initial candidate pair to get the final candidate pair,and then the final candidate pair is verified based on the edit distance.Finally,according to the analysis of experimental results,the DFP-ssj algorithm is improved in filtering capacity and similar join efficiency.Aiming at the problem that there are a large number of false positive examples in the results of similarity join of long string sets such as gene sequences,this paper designs a string similarity join method based on Locality Sensitive Hashing(LSH).This algorithm is implemented based on the filtering verification framework.Considering the complexity of calculating the edit distance between long strings,this method uses the embedding method to embed all strings in the edit space into hamming space.Then,LSH is used to map the data in hamming space.In this process,length filtering and k-value-based filtering methods are used to delete the false positive examples.Through the experimental analysis,the LSHbased similarity join method can effectively reduce the number of false positive examples,and has a better similarity join efficiency.

Keywords/Search Tags:

Filtering verification framework, Similarity join, Partition, LSH

PDF Full Text Request

Related items

1	A Partition-based Bi-Directional Filtering-Verification Method For String Similarity Joins
2	Research And Implementation Of Similarity Join For Big Data
3	Optimizing Top-k String Similarity Join Algorithm
4	Research On String Edit Similarity Join
5	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
6	Research And Implementation Of High Efficiency Set Similarity Join Algorithm Based On Overlap Similarity
7	DV-Join:A Novel Method Based On Dynamic Indexing And VMC-Filtering For Similarity Join
8	Research On Improvement Of Similarity Join In MapReduce
9	Research On Self-Similarity Join In Heterogeneous Networks
10	Aggregate Queries On Constrained Probabilistic Similarity Join Pairs