Font Size: a A A

An Algorithm Of Computing String Similarity Based On Improved Levenshtein Distance

Posted on:2014-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:A Q HanFull Text:PDF
GTID:2268330401482012Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Fuzzy query technology to solve the network information retrieval difficulty forusers, its core idea is the edit distance algorithm.Edit distance is published in1966byA.Levenshten, it is used to calculate the similarity of two strings or texts.Edit distancecan be regards as the minimum cost of transforming one string into another througha sequence of weighted edit operations.It has been widely applied in the field of fuzzyquery.When calculating the similarity of strings, the Levenshtein Distance algorithmonly considered the operating times and ignores the common substrings of twostrings;In the solusion of backtracking path,it can not be completely matches thepossibility of all string change steps.Aiming these problems, an improved Levenshteindistance algorithm is proposed to calculate the similarity. The new algorithmimproves the formula of similarity and the Levenshtein matrix. When calculating thedistance, the new algorithm calculates the longest common substring all the LDbacktracking paths in the original matrix at the same time. From the linear spaceNAKATSU algorithm for solving all path backtracking ideas, definitions for new LDmatrix all path backtracking algorithm. The experiment adopts intelligent ABC as adictionary. Selecting a word in the experiment as a source string, a set of similarwords of the different degrees of the source string as a target string, the new similaritymeasure formula is compared with the existing string similarity calculation method.Based on the idea of dynamic programming, in the calculation of matrix in each row,each row will be the minimum and the definition of search intensity value iscompared, if not meet certain conditions ahead of time out of the matrix, in certaincases, the demand is necessary.Comparison of new formula and retrieval tools Lucene toolkit FuzzyQuery class,the new formula reduce the number of target strings into the winner table withsimilarity sample range and standard deviation of0.331and0.150, respectively. Theexperiments results show that the new algorithm has higher accuracy and moreflexible searching way in the same space complexity.
Keywords/Search Tags:Edit Distance, LD algorithm, Backtracking path, Longest commonsub-string, Similarity, Fuzzy Query
PDF Full Text Request
Related items