Font Size: a A A

An Algorithm For Matching Strings With Wildcards

Posted on:2012-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z J YunFull Text:PDF
GTID:2268330425491585Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The problem of string matching with wildcards has become a research hotspot in many areas, such as bioinformatics analysis, SQL queries, text indexing of search engine, and entrust intrusion detection. Although query with wildcard can better meet the needs of the users, it makes the query processing more complex. Consequently, how to support string matching with wildcards efficiently poses great challenge to the existing techniques.A lot of methods have been proposed to solve this problem, most of which are query-based online algorithms. Only a few of them are index-based offline algorithms, but the great space consumption and strict wildcard constraints greatly limit the application of these offline algorithms. In this thesis, we focus on the problem of string matching with wildcards "*" and "?" in the query including exact string matching and approximate string matching, where "*" matches any sequence of characters and "?" matches any character in alphabet. Since gram-based index structure has advantages in both space and searching time, this thesis proposes an algorithm based on gram index structure.To begin with, we summarize the existent string matching algorithms. Based on the analysis of these algorithms, we propose an exact string matching algorithm based on q-gram index. In this algorithm, we divide the query string with wildcards into several query segments without any wildcard. Thus this problem is successfully transferred into a list of simple exact substring matching problems, In addition, we accelerate the query process with the help of length filter, position filter and count filter. Then we extend this exact string matching algorithm to support approximate string matching. One approach is proposed based on the results of the query segments taking advantage of Pigeonhole Principle which is often used as a filter method in approximate string matching problem. Part of the strings can be filtered if they are not the results of the query. However, this approach can only be used when edit distance threshold is very small and the filter capacity isn’t very good because of the existence of wildcards. Another approach is proposed based on gram index, which can speed up the query processing by filtering the impossible query answers according to the lower bound of the number of common grams between the query and the matching strings. Finally, the experimental results on real data sets show that the proposed algorithm of gram-based string matching with wildcards can not only reduce the cost of space but also has good filter ability and high searching performance.
Keywords/Search Tags:wildcard, q-gram index, suffix tree index, filter, string matching
PDF Full Text Request
Related items