Font Size: a A A

Multi-pattern Matching Algorithms With Wildcards

Posted on:2018-12-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:AHMED ABDO FARHAN SAIFFull Text:PDF
GTID:1318330515976118Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The amount of information that we process every day is increasing at a huge rate.To preserve such amount of information with strong Confidentiality,Integrity and Availability for all authorized users needs strong technologies and robust tools.The information security and management tools directly depend on their algorithms' efficiency and proficiency.Multi-pattern matching algorithms are essentially used in many information processing and information security applications.The main applications that are essentially based on multi pattern matching algorithms are Intrusion Detection/Prevention Systems,antivirus systems,data mining applications,data analyzing applications;such applications need strong and efficient algorithms that look for thousands of patterns simultaneously and accurately,and then report the positions of every pattern occurrence.The direct application of single-pattern matching algorithms looking for thousands of patterns one by one,in very long string,is not an efficient operation,especially when we deal with huge amounts of information.The needs to develop efficient multi-pattern matching algorithms that run in time without any delay for information transformation are essential.Otherwise,the security or processing tools will become the bottleneck of whole application performance.Multi-Pattern matching is defined as,given a set of k patterns P = {p0,p1,p5...pk} with total length ?P?,every pattern has its length m.And text t,with length n,those string chains are over the same finite alphabet ?,where,n,?P?>0 and n?m,if pattern p occurs in text t,they determine the occurrence positions and occurrence times.There are many techniques used to improve the performance of the pattern matching algorithms.General linear product,bit parallelism,packed string,Hamming distance,prime number encoding,regular expression,hash function and AC automaton are counted as the main techniques that are used to improve the performance of the multi-pattern matching algorithms.In case of pattern matching algorithms with wildcards,it can be regarded as a special case of general linear product,which can be treated as Boolean convolution or polynomial convolution.Polynomial convolution can be efficiently calculated by Discrete Fourier transform(DFT)in four main steps.Those steps involve the pre-processing step,computing DFT of the two polynomial coefficients,doing simple component-wise multiplication between the DFT results,and computing inverse DFT(Inverse Discrete Fourier transform(IDFT))of the multiplication products,as shown below.Let a and b being two vectors both of them with length n,we need to use the Fast Fourier Transform FFT(DFT and its inverse(IDFT)to compute the convolution between those two vectors,therefore,the following steps are required:1.Pad to the end of a and b each with 0 to define a' and b' asa' =[a0,a1,…,an-1,0,0,...,0]Tb' =[b0,b1,...,bn-1,0,0,...,0]T2.Compute the discrete Fourier transform for a' and b',y = Fa' and z=Fb'.3.Multiply the vectors y and z by component-wise defining the simple producty.z = Fa'.Fb'4.Compute the inverse discrete Fourier transform of this simple product.c = F-1(Fa'.Fb')The result c is a convolution result between a and b vectors and can be calculated in O(n log n)time complexity.These four steps above are called convolution theorem,and can be used to calculate the convolution between pattern and text in same time complexity i.e.O(n log m).Where,n is the number of polynomial coefficients of longer vector or length of text t,and m is the number of polynomial coefficients of smaller vector or length of pattern m.In some cases,the patterns are long,where Fast Fourier Transform(FFT)cannot be used to calculate the convolution in single machine operation.Multiplying big integers count as a good example to show how the long pattern can be fitted into small parts that,the process can be done in single machine operation.A hash function is a computational efficient function that maps binary strings of arbitrary length to binary strings of some fixed length,called hash values,where a hash value is used as a compact representative of an input string.A hash function computationally cannot find two distinct inputs,that hash to the same value.The hash function has mainly three schemes based on the used operations,first two schemes are hashing by division and hashing by multiplication,the third scheme is a universal hashing.In pattern matching,the hash value is used to reduce the computational expenses,where the pattern and every alignment of text are hashed to small values that are used to check the matches simply.Bit-parallel algorithms are based on taking the advantage of the fact that a single computer instruction can manipulate bit-vectors with w bits.If data-items of an algorithm can be packed into w bits,it will achieve a gain in time and space by processing many data within a single machine instruction.The following notations are used to describe the main and basic bitwise operations that are used in many researches,'?' denotes bitwise "OR",'&' denotes bitwise "AND",'!' denotes bitwise "NOT",'>>' denotes shift the bit-vector to the right and'<<' denotes shift the bit-vector to the left,where,in shifting,process zero padding is used in both directions.The packed string matching technique is based on the fact that the machine word size w is larger than alphabet character log2a.A single machine word can accommodate up to ? = w/log2? characters.Packed string matching guarantee that,many data can be packed to a single machine word and then can be manipulated by a single machine instruction.The main machine instructions that are used in pattern matching algorithms as bit-wise operations instructions include:logical instructions,arithmetic instructions and control instructions.There are especial instructions for string matching,wsmatch(a,b)(word-size matching instruction)and wscmp(a,b)(word-size compare instruction)is considered important specialized instructions for string matching.Specialized word-size packed string matching instructions are based on the Intel streaming SIMD extensions(SSE)technology.Recently,prime number encoding with Chinese Remainder Theorem CRT are proposed to improve the performance of pattern matching with wildcards.Many issues should be considered in multi-pattern matching algorithms developing.The main issues are,patterns group size,searched text size,alphabet size,individual pattern sizes,pattern character's case sensitivity,algorithmic attacks resistance,search running time,non-interrupted patterns updating,stringent worst case performance and wildcards existence.For instance,in intrusion detection systems,one of the important problems in developing efficient multi-pattern matching algorithm is the updating signature without stopping the intrusion detection systems,where such problems are caused by the algorithms that are based on structured data such as AC automaton.Recently there are many interesting-to-develop pattern matching algorithms,in standard form,as well as,in nonstandard form.Nonstandard form means that the pattern or text or both of them contain wildcards,and wildcards symbols match any character in the alphabet including itself.In this dissertation,we have studied multi-pattern matching problems with wildcards.First,we introduced four multi-pattern matching algorithms with wildcards as the main algorithms of previous works.The four algorithms are with four different mechanisms.The four algorithms rely on Fast Fourier Transform(FFT)to calculate the convolution between patterns and text.The first three algorithms are used to compare our three algorithms theoretically;the fourth algorithm is used to compare our algorithm practically.The first algorithm is based on calculating Euclidean Distance between patterns and text,and runs in O(n log ?P?+ooc log k)time,where ooc is the total number of occurrences of patterns P in text t.The second algorithm is based on computing a Hamming distance between patterns and text,and runs in O(n log ?P? log ? + occ log k)time,and ? is the alphabet size.The third algorithm is based on prime number encoding and runs in O(n log m + ooc log k)time,where m is the length of the longest pattern.The fourth algorithm is a repetition of the single pattern matching algorithm of Clifford and Clifford algorithm.Where Clifford and Clifford algorithm is among the most efficient algorithms in single-pattern matching with wildcards.The algorithm runs in O(k n log m)time.Secondly,we presented three multi-pattern matching algorithms with wildcards.The first algorithm is based on Euclidean-distance and Hash-function.The algorithm includes three main steps:pre-processing step,checking the first block of every pattern step and the step of checking the remaining part of the patterns when the first block is matched.In the pre-processing step,each pattern is partitioned into suitable length l blocks,assuming that the length of the shortest pattern is sufficient to create one block.If the last part of the pattern is not sufficient to create a complete block,the last block is overlapped with the previous block.The modified sum of squared differences between the first block of pattern py and the remaining blocks of same pattern are individually calculated.First,every symbol in each pattern and text are encoded with a unique integer number and the wildcards with Os.Then,the modified sum of squared differences value Rp[by,1,by,q]between the first block by,1 and by,q block is calculated as follows:(?)(?)(?)The Rp[by,1,by,q]values are used as hash values,H[by,1,by,q],to check the match of the remaining blocks.To check the match of the first blocks of each pattern,the modified sum of squared differences between each first block by,1 of the patterns and the text t for every possible alignment are calculated as follows:(?)(?)(?)The sum equals 0 at position i,if the block by,1 matches the text at position i.If the block by,1 of the pattern py matches the text,the pattern py counts as partially matched,and its remaining blocks need to be checked.To check the match of the remaining blocks of the partially matched pattern py,assume the block by,q is a remaining block of the partially matched py and the first block by,1 matches the text at position i;then,the match of block by,q needs to be checked at the position i +(q-1)l.The block by,q matches the text at position i +(q-1)l without a wildcards if H[by,1,by,q]=H[by,1,ti+(q-1)l].Where H[by,1,by,q]is the hash value between by,1 and by,q,and H[by,1,ti+(q-1)l]is the hash value between by,i and the text at position i +(q-1)l.If there are wildcard symbols in the matched remaining blocks of the patterns or in the text,there are lost values caused by the zero wildcards values.The lost values can be easily calculated using component-wise multiplication because the positions of the wildcards can be saved during the encoding step.Let Rt'(j)[by.1,by,q]and R'p(j)[by,q,ti+(q-1)l]are the sums of the lost values caused by the zero values of the wildcard symbols in the text and remaining block of pattern by,q,respectively.Then,the block by,q matches the text with a wildcards at position i +(q-1)l ifRp[by,1,by,q]+ R' P[by,1,ti?ql]=Rt(i + ql)[by,1,ti?ql]+ R't(i+ ql)[by,1,by,q].If the first block and all the remaining blocks of the pattern py match the text,the pattern py has a match.The algorithm has O(?P?-lk)pre-processing time,and efficiently finds the pattern matches in O(k n log l + o + d)time.Where n is the length of the text t,l is the length of the blocks,k is the number of patterns,o is the number of blocks that match using the hash values and d is the number of wildcard symbols in the blocks that match using the hash values in the patterns and text.The major advantages of our algorithm are that(a)it can find the matches of long patterns efficiently,(b)if the alphabet size a is large,the algorithm is still efficient and(c)it supports non-interrupted pattern updates.The second algorithm is based on bit-parallelism.The algorithm comprises two main processes in addition to the pre-processing step.The first process constructs the bit-vector of text symbols,which is called "updating the bit-vectors of text characters array tbv[i + ml]",The second process computes the Hamming distance,which is called "updating bit-vectors of result array R[i][y]." To guarantee that only one character is processed each time of the shifting process,a sliding window of length w moves along the text.Updating the bit-vectors of the result array R[i][y]occurs on the left side of the sliding window at position i of text,whereas updating the bit-vectors of the text characters occurs on the right side at position i +w.In a pre-processing step,the bit-vectors of pattern characters pbv[][]and the first occurrence positions of the pattern characters(fp[][])are constructed.Here,ind[][]is an array that is used to index the characters in which patterns occur.In updating bit-vectors of text characters,each character in the alphabet has two arguments,i.e.,the text character bit-vector tbv[]and the last occurrence position of the character lp[].The last occurrence position of the character is an integer number that indicates the last updated position of the text character bit-vector in the array tbv[].Text character bit-vector updating occurs at the right end of the sliding window at position i + w.When i = 0,the bit-vector of the character t[w]at position w should be updated;therefore,the bit-vectors of the first w text characters are constructed.In each shifting process of sliding window,the sliding window is shifted by 1,character x enters the sliding window,and the bit-vector of character x tbv[t[i + ml]]is updated.Character updating process can be achieved as follows.The first current bit-vector of character x tbv'[t[i + ml]]is shifted to the left side according to the difference between the current position and the last update position of character x(i + ml-lp[t[i + ml]]).tbv'[t[i + rml]]= tbv[t[i + ml]]<<(i + ml-lp[t[i + ml]])The bit of the character x in the window is added by OR-ing the shifted current bit-vector with unsigned integer 1.tbv'[t[i + ml]]= thv'[t[i + ml]]? 1Then,the bit-vector and its occurrence position are stored in tbv[t[z + ml]]and lp[t[i +ml]],respectively.In updating result array bit-vectors,assume that the sliding window is shifted by 1.In the left side of the sliding window,there is one character at position i that has left the sliding window.Assume the character that has left the sliding window is a character x.Then,the text bit-vector of character x is used to update the bit-vectors at position i of the array R[i][y]for all the patterns that contain the character x,which can be achieved as follows.First,the current text bit-vector of the character x is shifted to the left side according to the difference between the current position and the last update position(i-lp[t[i]]).tbv[t[i]]-tbv[t[i]]<<(i-lp[t[i]])Assume that a character x occurs in pattern py and the first occurrence of character x is at position fp[y][x]of the text.Then,the text bit-vector of character x is shifted to the right side by fp[y][x].tbv[t[i]]= tbv[t[i]??]fp[y][x]Then,an AND(i.e.,"&")bitwise operation is performed between the text bit-vector tbv[t[i]]and the pattern bit-vectorpbt[y][x].Temp= tbv[t[i]]&pbt[y][x]The bit-vector result is added to the result array by OR-ing it with the bit-vector at position(i-fp[y][x],j)of the array R[i-fp[y][x]][y].R[i-fp[y][x]][y]=R[i-fp[y][x]][Y]?TempThen,the pattern p' matches the text at position i if the bit-vector R[i-fp[y][x]][y]and the complement of the wildcards bit-vector of pattern py ! pbv[y][*]are identical.The algorithm efficiently finds the patterns matches in O(n + d(n/?)(m/w))time.The algorithm overcomes the problem of the high percentage of wildcards occurrence.The third algorithm is based on packed string and bit-parallelism.The main idea is to pack every ? = w/log2 ? symbols of text to a single machine word.The single matching word is used as a sliding window that slides along the whole text.In every shifting step,the packed string of the sliding window is compared with the packed string of the stored patterns.The algorithm includes two main process;the sliding widow updating process and comparing process,in addition to pre-processing process.Let us assume that the size of machine word w is a standard size(i.e.32 or 64 bits).Then,every machine word can contain up to ? = w/log 2 ? characters,where ? is the alphabet size,?= log2 ? is the number of bits that are used to encode a single character from the alphabet.In the pre-processing step,every symbol in the text and patterns is encoded to a unique integer number and wildcards to O's.All the patterns characters are packed to a machine word.Let us assume that every pattern can be encoded to a single machine word(i.e.m<?),therefore,every comparison process will take only O(1)time.The packing process is executed by shift and OR bitwise-operations.The string result is stored as packed string patterns.In the same way,we pack the text characters.First,a sliding window with ? = w/log2 ?size of characters slides over text t.The first a characters are packed,and then the packed string of the sliding window is compared with the packed string of patterns.The comparing process can be achieved as follows.First,the complement of the packed string of pattern py is computed,then AND bit-wise operation is executed at position i between the packed string of pattern py and the sliding window packed string,where the bits of wildcards word is all 0's.If the result bit-word is all 0's,that means there are a match of pattern py at position i of text.However,the matching maybe a false match.Other process are used for double check,the bits of wildcards change to all l's and OR process is executed between the complement of the packed string of pattern and the packed string of sliding window.If the bits of result word is all l's,then the pattern py has a match.Notice,in AND bit-wise process,the wildcards bits are all 0's,while in OR bit-wise process are all l's.In case the patterns cannot be packed into a single machine word(i.e.m>?),additional process are needed to check the matching of the remaining part of the partial matched pattern.
Keywords/Search Tags:Pattern matching, FFT, Hash function, Euclidean distance, bit-parallelism
PDF Full Text Request
Related items