Font Size: a A A

Research On The Algorithms For String Similarity Search

Posted on:2018-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:H Z HuangFull Text:PDF
GTID:2348330515492884Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
String similarity search is widely used in real life,such as similar web page detection,data cleaning,recommendation function of electronic business website,protein function prediction and so on.We generally use a given similarity function to determine whether the two strings are similar or not.There are many similarity functions,such as hamming distance,edit distance,cosine metric,jaccard metric and so on.Among them,edit distance is the most widely used similarity function.At present,the study of string similarity search based on a similarity function can be divided into three categories,they are threshold-based string similarity search,top-k string similarity search and string similarity connection.However,with the changing needs of users,the string similarity search based on a similarity function has been unable to meet the requirements of users,so string search with wildcards is gradually being taken up by the researchers.In this paper,three types of string similarity search problems are considered,namely threshold-based string similarity search,top-k string similarity search and string search with wildcards.In terms of threshold-based string similarity problem,currently,the methods of processing it are mostly based on filter-refine framework.In this paper,a method called PBsearch algorithm is proposed,which is also based on this framework.We optimize this method during both the filter and the refine stage.In the filter step,we use length filtering,quantity filtering,location filtering and add the One-Off condition to filter a large number of invalid matches for the first time.In the refine step,we come up with a new method called MultiThreshold algorithm to decrease the number of calculation for the edit distance greatly.To deal with top-k string similarity search problem,we put forward two partition-based algorithms which are called Pb-topk algorithm and PbCount-topk algorithm.Pb-topk algorithm uses the incremental strategy of the difference to reduce the number of strings to be treated;PbCount-topk algorithm utilizes a strategy,which is called the partition of the matching number strategy,to further shrink the size of the candidate set.In the problem of string search with wildcards,we focus on the problem of string search with wildcards"*" and "?" in the query.Although the Patricia trie index structure has an efficient property of prefix sharing,it still needs to be improved in query efficiency.In this paper,we present a modified structure called w-trie.It is based on the structure of Patricia trie,but it improves the query efficiency of Patricia trie structure.This structure has a wildcard branch in the inner node,which can also enhance the search efficiency.Given a wildcard constraint k,each inner node of w-trie consumes a wildcard until the wildcard constraint k has been satisfied on the branch,from the branch's root node to its current node.Finally,a string search algorithm with wildcards is proposed based on this structure,and it is optimized by length filtering in the search process.Finally,experimental results on three real data sets verify the effectiveness of the proposed algorithm to deal with threshold-based string similarity search,top-k string similarity search and string search with wildcards problems.
Keywords/Search Tags:string similarity search, threshold, top-k, partition, edit distance, wildcard, Patricia trie index, w-trie index
PDF Full Text Request
Related items