Font Size: a A A

Multi-pattern Matching Algorithms

Posted on:2013-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:AHMED ABDO FARHAN SAIFFull Text:PDF
GTID:2248330371983050Subject:Network and information security
Abstract/Summary:PDF Full Text Request
The amount of information that we are dealing with today is being generated at an everincreasing rate. To keep like this amount of information safely and with strong Confidentialityand Integrity and Availability for authorized users need a strong technologies and tools. Theinformation security and management tools depend directly to its algorithms efficiency.Multi-pattern matching algorithms are the core for many information securityapplications such as, antivirus systems and intrusion detection systems which look forthousands of patterns in same time and report the positions of them.The direct apply of single pattern matching algorithm to search one by one for thousandsof patterns is not efficiently to deal with huge amount of information specially, in bignetworks. Then, the needs to develop fast multi-pattern matching algorithms to work in timewithout delay for information transform are essential. Otherwise, the security tools will be thebottleneck of whole network speed.Pattern matching can be defined as: given a pattern p with length m, and text t, withlength n, over the same finite alphabet∑, where, n, m>0and n≥m, if pattern p occurs in textt, then determine the occurrence positions and occurrence times.Pattern matching problems can be regarded as special case of general linear product,which can be treated as polynomial convolution or Boolean convolution. Polynomialconvolution can be calculated by Discrete Fourier transforms (DFT) in four steps involvespreprocessing step, computing DFT of the two polynomial coefficients, doing simplecomponent-wise multiplication between them, and computing inverse DFT(Inverse DiscreteFourier transform (IDFT)) of the multiplication products, as shown bellow.If a and b is two vectors both of them with length n, and we need to use the DFT and itsinverse IDFT to compute the convolution between them, the following steps is required:1. Pad a and b each with0to define a′and b′asa′=[a0, a1,…, an-1,0,0,…,0]Tb′=[b0, b1,…, bn-1,0,0,…,0]T2. Compute the discrete Fourier transform for a′and b′, y=Fa′and z=Fb′.3. Multiply the vectors y and z component-wise defining the simple product 4. Compute the inverse discrete Fourier transform of this simple product. That iscompute c=F-1(Fa’.Fb’)These four steps called convolution theorem, and can be used to calculate theconvolution between the pattern and text.The time waster step is computing DFT and inverse DFT, we can use Fast FourierTransform (FFT) to calculate this step to get O(n log n) time, where, n is the number ofPolynomial coefficients(length of the pattern and text). The O(n log n) is the time needed tomultiply two polynomials, this means pattern matching can also be solved in O(n log n) time.Sometimes, the length of pattern is so long, where, the using of Fast Fourier Transform(FFT) to calculate the convolution cannot be done in single machine operation. Multiplyingbig integers is a good example to show how the long pattern can be fitted to small parts where,the process can be done in single machine operation.A hash function is a computationally efficient function mapping binary strings ofarbitrary length to binary strings of some fixed length, called hash values, which a hash valueserves as a compact representative of an input string. A hash function is typically chosen suchthat it is computationally infeasible to find two distinct inputs which hash to a same value.Hash function mainly has three schemes for their creation, two of the schemes, hashing bydivision and hashing by multiplication, the third scheme is universal hashing,There are many consideration issues to develop multi-pattern matching algorithms, themain issues are:1. Patterns group size.2. Searched text size.3. Alphabet size.4. Individual Patterns sizes.5. Pattern character case sensitivity.6. Algorithmic attacks.7. Search running time.8. Non-interrupted patterns updating.9. Stringent worst case performance.10. Wildcard existence.In intrusion detection systems one of important problems in developing fast multi-patternmatching algorithm is the updating signature without stopping the intrusion detection systems, like algorithms depend on automaton.Recently there are many interesting to develop pattern matching algorithms in standard,as well as, with wildcard which, the pattern or text or both of them contain wildcard symbol,where, wildcard symbol match any character in the alphabet.The single pattern Karp-Rabin algorithm is based on computing a fingerprint of thepattern by hash function. The first step is to encode every character by unique prime numberfor both of pattern and text, then compute the hash function of a searched pattern. From thebeginning of the text, take the size of a slid window equal to the size of searched pattern, withsame hash function compute the fingerprint of slid window and compare it with searchedpattern fingerprint, if equal, first check is that pattern really occurred, then report the positionof slid window.In this dissertation, we have studied multi-pattern matching problem in both standard andwith wildcard. In standard form, multi-pattern matching based on Hash function is developed,which extend Karp-Rabin algorithm to run with preprocessing time O(|P|), if there is nomatching (real and false), and all the patterns have same lengths the algorithm will run inO(n+nlogk) time, if there are matching for a pattern have length l the verification functionwill need l time to verify, the overall algorithm times O(n+nlogk+ooc(l)) or O(mn+nk+ooc(l))time if the patterns have m different lengths (worst case), where occ is the total number ofreally and false matching of pattern p in text t. The numbers computed by the algorithm aslarge as is xm(m+1) where σ≈x/lnx. The algorithm is not easy to attack and can work well onq processer in parallel form.There are many multi-pattern matching algorithms with wildcard are presented, here we firstpresent four algorithms with four technique, as a previous work, three of that algorithms useFast Fourier Transform (FFT) technique to compute the convolution between patterns and text,where Fast Fourier Transform (FFT) consider the fastest method to compute the convolutionbetween two vectors, the first algorithm based on prime number encoding and run in O(n logm+ooc log k) time, where m is the length of the longest pattern. The second algorithm basedon computing a Hamming distance between patterns and text, and run in O(n log|P|log σ+occ log k) time, and σ is the alphabet size. The third algorithm based on Euclidean Distanceand run in O(n log|P|+occ log k) time, where ooc is the is the total number of occurrences ofpatterns P in text t. The fourth algorithm does not use Fast Fourier Transform (FFT), it basedon AC automaton and run in O((|t|+∥P∥)log K/log log K) time, where K is the number of keywords in the patterns.Secondly, we present a multi-pattern matching algorithm with wildcard based onEuclidean-distance&Hash-function, the algorithm first divides the patterns to small l lengthblocks and also divide the text to2l length overlapped blocks, where, every block has anoverlap of length l with the previous one. The algorithm check the first block of every patternby calculating the Euclidean distance using Fast Fourier Transform (FFT) like in Clifford andClifford algorithm. According to Kalai algorithm, Euclidean distance can be used as hashvalue; these values can be used to check the remaining blocks of the pattern where its firstblock has matching. If we have k patterns, the time to check the first blocks of k patterns usingFast Fourier Transform (FFT) is O(k n log l) time. If there are o patterns have matching andevery pattern has x blocks, the time need to check the remaining blocks is O(ox) time. If thetext has wildcard symbol the hash value will change, so, another additional computationneeded to calculate that change.Thus, if there is no matching, the algorithm runs in O(k n log l) time, and O(k n log l+ox)time if there are a matching with wildcard and the wildcard symbol only in patterns, if the textalso contain wildcard the algorithm runs in O(k n log l+2ox) time. For one matched pattern,the false matching probability is1/σx-1, where k is the number of patterns, l is the block length,o is the number of matched patterns, and x is the number of blocks in a matched pattern.
Keywords/Search Tags:Multi-pattern matching, FFT, Hash function, Euclidean distance
PDF Full Text Request
Related items