Font Size: a A A

Research On String Similarity Join Based On Filtering Verification Framework

Posted on:2021-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:R ChenFull Text:PDF
GTID:2428330647961941Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the basic operation of data processing,similarity join is widely used in the fields of data cleaning,similarity web search,bio-informatics and so on.However,the continuous growth and change of data size and data type bring challenges to similarity join.Although some existing string similarity join algorithms based on filtering verification framework can improve the efficiency of similar join,there are still some problems such as insufficient filtering ability and not well adapted to the string sets with large differences in length.Therefore,in order to meet the requirements of high efficiency and high precision of similarity join algorithm,this paper mainly studied the string similarity join algorithm based on filtering verification framework.The main work and contributions are as follows:In view of the fact that the filtering capability of the existing algorithm is gradually unable to meet the needs of large data sets,a double filtering string similarity join algorithm based on partition(DFP-ssj)is designed.The algorithm first divides the string set into several subsets.Then the length filtering method is used to reduce the time consumption of the algorithm in the process of similar join between and within each subset.In order to further improve the filtering ability,the prefix filtering algorithm is used in the algorithm to delete the initial candidate pair to get the final candidate pair,and then the final candidate pair is verified based on the edit distance.Finally,according to the analysis of experimental results,the DFP-ssj algorithm is improved in filtering capacity and similar join efficiency.Aiming at the problem that there are a large number of false positive examples in the results of similarity join of long string sets such as gene sequences,this paper designs a string similarity join method based on Locality Sensitive Hashing(LSH).This algorithm is implemented based on the filtering verification framework.Considering the complexity of calculating the edit distance between long strings,this method uses the embedding method to embed all strings in the edit space into hamming space.Then,LSH is used to map the data in hamming space.In this process,length filtering and k-value-based filtering methods are used to delete the false positive examples.Through the experimental analysis,the LSHbased similarity join method can effectively reduce the number of false positive examples,and has a better similarity join efficiency.
Keywords/Search Tags:Filtering verification framework, Similarity join, Partition, LSH
PDF Full Text Request
Related items