Font Size: a A A

Approximate String Matching And Optimizition Techniques Using Edit Distance

Posted on:2013-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2298330467972085Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the wide application of computer technology in various fields, the amount of information has also grown exponentially. For a variety of reasons, it is inevitable to have some errors in so much data. Among various types of data, text data is the most common and the oldest one. Approximate string matching, or mathcing on text data allowing certain errors, has been used in many fields, such as data cleaning, entity recognition, spelling check, collection joins and Web search. However, approximate string matching provides users with more powerful and more comprehensive functions on the one hand. On the other hand, it. makes the matching process and indexing more complex. Therefore, it is essential to make approximate strings matching efficiently.At present, there is a lot of work committed to solve the problems of approximate string matching. Most of the work is concerning the approximate string matching based on edit distance. Most edit distance based approximate string matching algorithm is based on signature and uses the index structure to support approximate string matching. During matching, algorithm adopts a two-stage method, which is filter stage and verification stage. The filter stage first obtains string collections whose element is similar with the query string from the index structure, and then filters out some string that cannot be the result string to generate the candidate set through a variety of filtering algorithms. Candidate set contains all the results as well as some strings that are not qualified, so it is necessary to verify each string in the candidate set to determine whether it is the result or not in the verification stage.Based on the existing work, this article combines the idea of variable-length gram algorithm and the idea of asymmetric fixed-length chunk-gram algorithm to provide a new algorithm, which is the asymmetric variable-length chunk-gram algorithm, and analyzes the lower bound of common signatures of this algorithm. Based on variable-length chunk-gram algorithm, the optimal variable-length chunk algorithm is presented. This algorithm divides the query string into optimal τ+1variable-length chunks. Based on BWT, this article also provides a new BWT_PA index to support the division of the optimal τ+1variable-length chunks, so that the algorithm meets the minimum number of signature, τ+1, based on the signature framework. Moreover, this article studies how to apply the signature-based approximate string matching technique to the top of the commercial database management system (DBMS) to support approximate string matching. Therefore, it is unnecessary to modify the bottom of the commercial database when using the features of commercial databases. Finally, a lot of experiments on real data sets are made. Experimental results and analysis of experimental results prove that the proposed algorithm is capable of solving approximate string matching problems efficiently.
Keywords/Search Tags:edit distance, approximate matching, DBMS, string, query optimization
PDF Full Text Request
Related items