Research On Filter Algorithms For Approximate String Matching

Posted on:2010-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:D C Sun

Full Text:PDF

GTID:2178330338982384

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Approximate string matching was a basic issue in computer science. It was widely used in various fields such as Information Retrieve, Computational Biology, Pattern Recognition, et al. Researching on approximate string matching for high speed, high accuracy and low consumption will be a push to these areas. Q-gram index has the merit of language independence and garble tolerant, and it was often used to processing Chinese. Filter algorithm could throw off a lot of text by filter criterion, so it was often used for matching in big text. Q-gram was widely used for its simple and high speed, and it was always used with filter algorithm together. To matching in Chinese dataset and improve matching speed, the key issues of approximate string matching such as Chinese index structure, optimize index and extract new Match-Region features were discussed in this thesis.In this thesis, an index for Chinese text was presented. First, this method mapped the Chinese word of GB2312 Charset into integers by Hash function. Second, a two-level index structure was used to store Bigram terms.To speed up index creation and reduce the consumption of memory q-gram index was optimized. First, linked list memory management scheme was used on creating of q-gram index in order to utilize memory effective. Second, compression could reduce both the size of indexes and the time needed to evaluate queries. The optimal compression method and parameters for Chinese index were found by experiment with various compression methods.The last aim is to speed up approximate string matching by improving filtration efficiency. A novel filter algorithm based on Match-Region features was presented. First, both pattern and text were logically divided into blocks of fixed size, and then new Match-Region features were extracted from blocks, and the algorithm optimizes the fundamental q-gram filtration criterion by the new features. Last, the improved method of choosing Filtration-Region based on blocks was used for filtration. The experimental results demonstrate that the algorithm achieves higher matching speed than the original algorithm under low error rate.

Keywords/Search Tags:

Approximate String Matching, Filter Algorithm, Q-gram index, Index Compression, Match-Region Feature, Filtration Criterion

PDF Full Text Request

Related items

1	Research On Q-gram Filters For Approximate String Matching
2	An Algorithm For Matching Strings With Wildcards
3	Approximate Text Alignment Techniques And Optimization Approaches
4	Research On Approximate String Matching Techniques Based On MC Index Structure
5	Research On Index Technology Of String Approximate Query
6	Research On Approximate String Matching And Its Application On URL Detection
7	Approximate String Matching And Optimizition Techniques Using Reversed Filter
8	Approximate Chinese String Matching Techniques Based On Pinyin Input Method
9	An Improved String Similarity Join Algorithm
10	Research And Implementation Of Approximate String Matching Algorithm Supporting Swapping