Font Size: a A A

Sequentially Matching Similarity String Algorithm Research

Posted on:2017-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:R L ZhangFull Text:PDF
GTID:2308330503957632Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Approximate string matching is a basis problem in the computer science, which was been applied widely in the field of text retrieval, bioinformatics, signal processing, intrusion detection, pattern recognition, data mining and entity recognition etc. The matching efficiency determines the efficiency of these applications. The traditional dynamic programming algorithm is inefficient, and the automation algorithm is complex, so the filtering-verification algorithm became the most popular method. In order to improve the efficiency, we proposed a sequential detection method. The contents of this paper are as follows.1. We proposed a sequential detection method to compute edit distance, which compares the strings in order. When the local position is mismatch, we choose a reasonable edit basic operation. This algorithm’s time complexity is O(n), while the dynamic programming is O(n2). The key is to choose the basic operation when the local position is mismatch. This paper proposed two methods. The first method is based on the basic operation sequence. First, we enumerate the basic operation sequence of a certain edit distance. According to the basic operation sequence, we choose the basic operation of local mismatch positon. This algorithm applies when the edit distance is less than or equal to 3. The second method is based on the local optimal rules. We determine the edit basic operation according to the number of handled characters and length difference of the rest strings when add a basic operation. This method gets an estimated upper bound of the edit distance, which can verify and filter the similarity string.2. We proposed a threshold k approximate string matching algorithm, base on the sequential detection method and the filter-verfication framework. In the filter step, use the offline count filter first. Then, use the local optimal rules sequential detection estimate the edit distance, Filter the candidates according the estimated value. In the verification step, use the basic operation sequence verify the candidates when k is less or eqeal to 3, otherwise, use the dynamic programing algorithm verify the candidates. The experiments show that compared to the Merge Filter algorithm, this method’s time efficiency improved at least 17% in DBLP, IMDB, WebCorpus dataset.
Keywords/Search Tags:approximate string match, edit distance, sequential detection, edit operation sequence, local optimal rules
PDF Full Text Request
Related items