Font Size: a A A

Research On Multi-pattern Matching Algorithm For Chinese Based On The Characteristics Of Chinese Character Coding

Posted on:2016-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2308330473957028Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of network and the information explosion in recent years, pattern matching technology is the core technology of content filtering and information retrieval, which becomes an important research direction in the field of Computer applications and information security. For the large-scale Chinese pattern matching problem, existing algorithms of time and space performance is hard to meet the demand.Therefore, It has important significance to improving the performance of computer application system and the time efficiency of Chinese pattern matching algorithm.This dissertation introduces several classic pattern matching algorithms, including BF-algorithn、BM-algorithm and Sunday algorithm,etc. researchcing AC、AC_BM and WM multiple pattern matching algorithm,etc in-depth. Meanwhile, describing the matching process of the algorithms in detail as well as analyzing their advantages and disadvantages and the time performance.Towards the large-scale matching in Chinese pattern string, Due to the relatively high divergence of Chinese characters, finite state automaton has long zero state, it cost more time finding character zero state, even the efficiency of algorithms sharp declined. Aiming at this problem, this dissertation design a kind of storage structure and corresponding multiple pattern matching algorithm based on Chinese character coding characteristics. The algorithm has the following features.(1) Considering the first byte range is smaller than trail byte range,in the process of matching, looking up the first byte at first, then trail byte, reducing the search time,skipping directly if it appears failure so as to reduce the search time.(2) Setting the matching tag to zero state characters, when matching characters tagged as 0,no longer matches the following character of the pattern, avoiding matching repeatedly and matching partially to save the matching time.(3) Storeing the jump distance in the state list and matching barrel to avoid looking up jump distance repeatedly to improve the matching efficiency of the algorithm.This dissertation have tested the time performance of different storage methods and different matching algorithm on the finite state automata.It shows that the time performance of this dissertation storage structure is better than the storage mode of state matrix and adjacency list. The time performance of algorithm in this dissertation is better than the AC_Tuned BM-algorithm, the AC_WM-algorithm and the IACBM-algorithm, so it is suitable for the large-scale Chinese pattern matching.
Keywords/Search Tags:Multiple pattern matching, Finite state automata, Chinese character coding features, tag
PDF Full Text Request
Related items