Font Size: a A A

Approximate Pattern Matching With Gap Constraint Under The Edit Distance

Posted on:2015-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:S RenFull Text:PDF
GTID:2298330452494223Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The problem of pattern matching is the oldest and most widely studied in computerscience, with the increasing scale of the text that people need to deal, and the search in textis become more complex, the approximate pattern matching that allowed some slightdifferences between text and mode has been widely used. But the research in connectionwith the problem of approximate pattern matching with gap constraint has not yet mature,so this thesis opens up the research and implementation contrary to this issue.The problem was divided into two parts, approximate string matching problem and thepattern matching with gap constraint. In aspect of approximate string matching, firstly thepaper introduces some basic string matching related knowledge, including the concept ofstring matching as well as some traditional string matching methods. Then studies themethods of approximate string matching, including dynamic programming, automata, bitparallel and text filtering. Through researching and comparing four methods, the paper willuse dynamic programming as the primary method for approximate string matching, and useedit distance model as the measure of string similarity model, programming of theestablishment of the editing distance matrix as well as the calculation of string edit distance.In aspect of pattern matching with gap constraints, the text string is saved as plain text, weindexed each string in the text that same as the way Lucene inverted indexed plain textdocuments. This index contains the string position and the number of occurrences, on thebasis of the index, and we used SpanNearQuery in Lucene to search the pattern. Becausewhen the text is too large, common ordered method will cost a lot of time, so we abandon it,and it brings us to a very high matching efficiency.Finally, we compare the paper’s algorithm and common order algorithm, under thepremise of guaranteed the same text, the same pattern and the same result, after a lot ofexperiments and analysis of the experimental results, we can see that when the text is largeand not change frequently, the algorithm in the paper consumes a reasonable space inexchange for the absolute reduction in time, it’s better than ordinary sequential algorithm.
Keywords/Search Tags:approximate match, gap constraint, edit distance, Lucene, invertedindex
PDF Full Text Request
Related items