Font Size: a A A

Approximate Longest Common Substring Matching And Optimization Techniques With Edit Distance Constraint

Posted on:2015-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:X YeFull Text:PDF
GTID:2308330482454491Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, numerous data is preserved in computers in the form of text. The information retrieval about text data, for example the longest common substring which is a classical problem of text management and program analysis, has attracted lots of attention and been widely researched for a long time.However, the longest common substring must be composed of the continuous exact matches. Two strings which are similar in local have common segments. In practice, the common segments are not continuous exact matches. So a new method to solve approximate longest common substring matching with edit distance constraint need be proposed. But the existed algorithms can not apply to the situation. Therefore, the thesis focuses on the problem of approximate longest common substring matching with edit distance constraint.At first, the thesis reviews some algorithms on approximate alignment. Based on the previous studies, a dynamic programming-based algorithm is designed to answer the approximate longest common substring with edit distance constraint in this thesis. For two strings, the alignment between two strings is to find the approximate longest common substring under edit distance constraint in the algorithm, using dynamic programming of edit distance and the longest subsequence. However, the algorithm need O(kmn) time. Therefore, another query algorithm based on common substring is proposed in this thesis. Firstly, common substrings of two strings are obtained by suffix array. Then all the connections of common substrings are enumerated to construct the answers by properties of the approximate longest common substring with edit distance constraint. The filtering techniques based on the position and connection distance of common substrings are adopted in the algorithm to improve the performance.At last, the two different algotithms are tested in three real datasets. The results of the experiments on the real datasets demonstrate that both the algorithm can answer all approximate longest common substrings with edit distance constraint and the performance of the algorithm based on common substrings is better than the one of the dynamic programming-based algorithm.
Keywords/Search Tags:approximate alignment, edit distance, dynamic programming, suffix array, longest common substring
PDF Full Text Request
Related items