An Algorithm Of Computing String Similarity Based On Improved Levenshtein Distance

Posted on:2014-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:A Q Han

Full Text:PDF

GTID:2268330401482012

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Fuzzy query technology to solve the network information retrieval difficulty forusers, its core idea is the edit distance algorithm.Edit distance is published in1966byA.Levenshten, it is used to calculate the similarity of two strings or texts.Edit distancecan be regards as the minimum cost of transforming one string into another througha sequence of weighted edit operations.It has been widely applied in the field of fuzzyquery.When calculating the similarity of strings, the Levenshtein Distance algorithmonly considered the operating times and ignores the common substrings of twostrings;In the solusion of backtracking path,it can not be completely matches thepossibility of all string change steps.Aiming these problems, an improved Levenshteindistance algorithm is proposed to calculate the similarity. The new algorithmimproves the formula of similarity and the Levenshtein matrix. When calculating thedistance, the new algorithm calculates the longest common substring all the LDbacktracking paths in the original matrix at the same time. From the linear spaceNAKATSU algorithm for solving all path backtracking ideas, definitions for new LDmatrix all path backtracking algorithm. The experiment adopts intelligent ABC as adictionary. Selecting a word in the experiment as a source string, a set of similarwords of the different degrees of the source string as a target string, the new similaritymeasure formula is compared with the existing string similarity calculation method.Based on the idea of dynamic programming, in the calculation of matrix in each row,each row will be the minimum and the definition of search intensity value iscompared, if not meet certain conditions ahead of time out of the matrix, in certaincases, the demand is necessary.Comparison of new formula and retrieval tools Lucene toolkit FuzzyQuery class,the new formula reduce the number of target strings into the winner table withsimilarity sample range and standard deviation of0.331and0.150, respectively. Theexperiments results show that the new algorithm has higher accuracy and moreflexible searching way in the same space complexity.

Keywords/Search Tags:

Edit Distance, LD algorithm, Backtracking path, Longest commonsub-string, Similarity, Fuzzy Query

PDF Full Text Request

Related items

1	Research On Similarity Query Over Sequence Data
2	Sequentially Matching Similarity String Algorithm Research
3	An Improved String Similarity Join Algorithm
4	Research On String Edit Similarity Join
5	Efficient Approximate String Matching With Edit Distance Constraint
6	Research On Top-k String Similarity Search Based On Edit Distance
7	Approximate String Matching And Optimizition Techniques Using Edit Distance
8	On The Irregular Planar Fragments Matching
9	Study On Fast Algorithms For Edit Distance
10	The String Pattern Matching Algorithm Based On Edit Distance