Font Size: a A A

Research On Top-k String Similarity Search Based On Edit Distance

Posted on:2016-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2308330479990104Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
String similarity search has a wide variety of applications such as in data cleaning, data integration, error checking, plagiarism detection, biological sequence analysis and so on. So far, a variety of similarity functions have been proposed to measure the similarity between two strings. While it is widely recognized that there does not exist a similarity function that provides the best quality in all application domains, edit distance remains one of the widely accepted similarity measures. Therefore, we employ edit distance to measure the similarity between two strings. In general, two types of queries are often considered range search and top-k similarity search. In this paper we force on top-k similarity search problem. Given a query string q, top-k similarity search returns k strings in S with the smallest edit distance to q.Nowadays, there are three kinds of algorithms to solve the string top-k similarity search problem. AQ and Appgram are based on n-gram model, but the two methods ignore the position information of the n-gram. AQ algorithm needs establish multiple indexes according to different length of n-gram, thus it leads to low efficiency. Appgram is not an exact query algorithm, it tends to lost some results. Bedtree is based on B+-tree structure, due to the loose filtering condition, it has to traverse through multiple nodes. Range algorithm and Hstopk algorithm are very effective for short strings, when dealing with long string collection, in order to improve the query performance, the two algorithms need a lot of extra memory space.Aiming at the defects of the present method,In this paper, based on the n-gram model, combined with the n-gram position information, we design a new inverted index structure, combine with this new index structure we designe two search strategies: sequential search algorithm and cycle search algorithm. In the filtering stage, combined with the length filter, counting filter and postion filter, In addition, we adopt frequency filter. Experiments show that the frequency filtering can further filter out a large number of candidate strings. At the same time, we study the initialization problem in top-k similarity search. Combined with statistical information in the whole string collection, we classify the string collection and use the strings which are in the same class with the query string to initialize. Finally we design disk index construction and query method. Finally we conduct multiple sets of comparative experiments in three real data sets to demonstrate the proposed algorithm in this paper is efficient. Firstly we test the memory consumption of various algorithms in the three real data sets. Then we compare the query performance of various algorithms. At the same time, we test the two kinds of index construction methods, two kinds of search strategies, four kinds of heap initialization algorithm. Then we test the construction time and query performance based on disk index. Finally we has carried on the test of the length filtering strategy and frequency filtering strategy. Through experiments it proves the high efficiency of the algorithm proposed in this paper.
Keywords/Search Tags:top-k similarity search, edit distance, inverted index, frequency filtering
PDF Full Text Request
Related items