Font Size: a A A

Research On Top-k String Similarity Query Algorithms

Posted on:2016-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y J HanFull Text:PDF
GTID:2308330479951074Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In this paper, we focus on computing top-k similar strings based on edit distance, i.e, given a query string and string set, finding k similar strings to query string based on edit distance from string set. It is an important operation in many applications, such as data integration, data cleaning, bioinformatics and pattern recognition.The main research is as follows:Firstly, we point out the repeated processing problems when adapting existing threshold-based string similarity query methods to top-k string similarity query. We then proposed strategy that filters irrelevant strings based on length difference between string lengths in ascending order based on length-skip index. We can terminate the processing of invert-lists in advance according to this strategy. Then proposed the lower bound of edit distance between query string and unmatched string set(no common signatures with query string). We proposed the basic top-k string similarity query algorithm based on above strategy and lower bound, called Topk Length algorithm.Secondly, for the problems of inefficiency when processing local invert-lists and unmatched string set with query string in Topk Length algorithm, we first propose novel adaptive count filter strategy based on current kth result, futher reduce the times of edit distance computation when process local invert-lists.Then according to the common signature lower bound of string edit distance within threshold under asymmetric signature scheme that we adopt, further reduce the times edit distance computation when processing the the unmatched string set with query string,The point of these two strategies is updating string similarity threshold based on current kth result dynamicly. Then present more efficient Topk Length Count algorithm.Finally, we perform environments and verify the performance of our algorithms in different real data sets. The experimental results show that Topk Legnth Count algorithm is outperformed Topk Length algorithm. Compared with state-of-the art top-k alogorithm, the results show the efficiency and scability of Topk Length Count algorithm,...
Keywords/Search Tags:top-k string similarity query, length-skp index, asymmetric signature scheme, adaptive count filter
PDF Full Text Request
Related items