Font Size: a A A

The Research Of Fast Multi-Pattern Matching Algorithms

Posted on:2018-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:H C WangFull Text:PDF
GTID:2348330512976908Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Pattern matching can be understood as the problem of getting the position of one or more patterns with certain property in a given sequence of symbols.Pattern matching is a fundamental problem in computer science.It has been widely studied,and widely used in information retrieval,intrusion detection,IP routing,data mining,text compression,optical character recognition,semantic analysis and many other areas involving symbol and character processing.In recent years,the explosive growth of Internet traffic,the increasingly serious network security situation,the rapid development of DNA sequence alignment technology,urgently need higher performance pattern matching algorithms.This paper first introduces classic single-pattern matching algorithms and Aho-Corasick algorithm,AC-BM algorithm,Wu-Manber algorithm,which are the most widely used of multipattern matching algorithms,and analyzes the advantages and disadvantages of these algorithms.Then through the experiment testing,compares with the performance of these algorithms,analyzes the influence of the length and number of pattern strings on the temporal and spatial performance of algorithms.Aho-Corasick algorithm is a classic multi-pattern matching algorithm with good matching speed,having been used in many scenarios.The drawback of this algorithm is that the space of the AhoCorasick automaton increases sharply as the size of the set of pattern strings becomes larger.Aiming at the problems of large space occupation and inefficiency in storage of Aho-Corasick algorithm,predecessors have done a lot of improvement work.The main improvement work is focused on the study of more efficient storage structure to save the transfer information of Aho-Corasick automaton.Classic efficient storage formats include Banded-Row format vectors,Sparse-Bands format vectors and bitmap.After in-depth analysis of predecessors' improvement,an improved space-efficient AhoCorasick algorithm based on classification storage is proposed.The storage format of state nodes in the new algorithm is not limited to a single format.It can be flexibly chosen from transfer table,Banded Row vector format,Sparse Bands vector format,bitmap and other format.In the preprocessing phase,calculate the transition information,failure information and output information of each state node,then compare with predefined rules,choose the most appropriate storage format to store state node in order to minimize the cost of storage space.In the search stage,according to the state node identifier,identify storage type of the state node,use the appropriate way to read the transfer information,failure information and output information to complete the search.Experiments show that compared with classic Aho-Corasick and Bitmapped Aho-Corasick algorithms,the new algorithm vastly compresses the storage space of automaton at the expense of slight time performance in the search stage.Therefore,the new algorithm based on classification storage is suitable for scenes with expensive space resources and large pattern sets.
Keywords/Search Tags:Multi-pattern matching, Aho-Corasick algorithm, Bitmapped Aho-Corasick algorithm, Space-efficient
PDF Full Text Request
Related items