Font Size: a A A

Research On Filter Algorithms For Approximate String Matching

Posted on:2010-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:D C SunFull Text:PDF
GTID:2178330338982384Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Approximate string matching was a basic issue in computer science. It was widely used in various fields such as Information Retrieve, Computational Biology, Pattern Recognition, et al. Researching on approximate string matching for high speed, high accuracy and low consumption will be a push to these areas. Q-gram index has the merit of language independence and garble tolerant, and it was often used to processing Chinese. Filter algorithm could throw off a lot of text by filter criterion, so it was often used for matching in big text. Q-gram was widely used for its simple and high speed, and it was always used with filter algorithm together. To matching in Chinese dataset and improve matching speed, the key issues of approximate string matching such as Chinese index structure, optimize index and extract new Match-Region features were discussed in this thesis.In this thesis, an index for Chinese text was presented. First, this method mapped the Chinese word of GB2312 Charset into integers by Hash function. Second, a two-level index structure was used to store Bigram terms.To speed up index creation and reduce the consumption of memory q-gram index was optimized. First, linked list memory management scheme was used on creating of q-gram index in order to utilize memory effective. Second, compression could reduce both the size of indexes and the time needed to evaluate queries. The optimal compression method and parameters for Chinese index were found by experiment with various compression methods.The last aim is to speed up approximate string matching by improving filtration efficiency. A novel filter algorithm based on Match-Region features was presented. First, both pattern and text were logically divided into blocks of fixed size, and then new Match-Region features were extracted from blocks, and the algorithm optimizes the fundamental q-gram filtration criterion by the new features. Last, the improved method of choosing Filtration-Region based on blocks was used for filtration. The experimental results demonstrate that the algorithm achieves higher matching speed than the original algorithm under low error rate.
Keywords/Search Tags:Approximate String Matching, Filter Algorithm, Q-gram index, Index Compression, Match-Region Feature, Filtration Criterion
PDF Full Text Request
Related items