Font Size: a A A

Research On Multi-pattern Matching Algorithm For Chinese Based On Finite State Automata

Posted on:2014-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:B YangFull Text:PDF
GTID:2268330401488761Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Pattern matching is an important research direction of the computer applicationfield and is widely used in the Intrusion Detection, the Information Retrieval andthe Biomedical Sciences. With the development of the Internet, the information ofnetworks increases sharply. How to improve the space and time performances ofpattern matching algorithm have become research focuses of the InformationRetrieval and Content Filtration.Several important single-pattern and multi-pattern matching algorithms aredescribed in this thesis, such as BF, BM, QS, AC, AC_BM and WM. The timeperformances of these algorithms are analyzed, their advantages and disadvantagesare discussed.With the increasing in the number of Chinese pattern string, multi-patternalgorithms based on the finite state automata (FSA) require a lot of storage spacewhich causes much time to be needed to calculate the goto function and search forthe jump-distance table, so the space and time performances decline sharply. Forthis problem, a new storage structure based on adjacency-list with jump-distance isproposed. The storage structure establishes a linked list for each state, and all linkedlists constitute an adjacency-list which will blend with the jump-distance table, sothe state and the jump-distance can be inquired simultaneously, which improves thealgorithm’s time and space performance effectively. Based on the adjacency-listwith jump-distance, a multi-pattern matching algorithm for Chinese named AC_SC(AC Suitable for Chinese) is proposed, time and space performances of thisalgorithm are analyzed.Performances of different storage structures of the FSA and time performancesof different jump-matching algorithms are tested. Experimental results show that thespace performance of the adjacency-list with jump-distance is better than that of thestate matrix and the complete Hash table, the time performance is better than that ofthe state matrix and a little worse than that of the complete Hash table. The averagejump-distance and the time performance of AC_SC algorithm are better than that ofother jump-matching algorithms such as AC_Tuned BM, AC_WM and IACBM.
Keywords/Search Tags:Multi-pattern Matching, FSA, Adjacency-list, AC_SC algorithm
PDF Full Text Request
Related items