Font Size: a A A

Multi-pattern string matching algorithms

Posted on:2011-03-11Degree:Ph.DType:Dissertation
University:University of FloridaCandidate:Zha, XinyanFull Text:PDF
GTID:1448390002467458Subject:Engineering
Abstract/Summary:
Multi-pattern string matching is widely used in applications such as network intrusion detection, digital forensics and full text search. In this dissertation, we focus on space efficient multi-pattern string matching as well as on time efficient multicore algorithms.;We develop a highly compressed Aho-Corasick automata for efficient intrusion detection. Our method uses bitmaps with multiple levels of summaries as well as aggressive path compaction. Our compressed automata takes 24% to 31% less memory than taken by the compressed automata of Tuck et al. [57] and the number of additions required to compute popcounts is reduced by about 90%.;We propose a technique to perform high performance exact multi-pattern string matching on the IBM Cell/Broadband Engine(Cell) architecture, which has 9 cores per chip, one control unit and eight computation units. Our technique guarantees the remarkable compression factors of 1 : 34 and 1 : 58, respectively on the memory representation of English language dictionaries and random binary string dictionaries. Our memory-based implementation delivers a sustained throughput between 0.90 and 2.35 Gbps per cell blade, while supporting dictionary sizes up to 9, 260, 000 average patterns per Gbyte of main memory.;We focus on Scalpel, a popular open source file recovery tool which performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs. Using these methods, we are able to do in-place file carving in essentially the time it takes to read the disk whose files are being carved. Since, using our methods, the limiting factor for performance is the disk read time, there is no advantage to using accelerators such as GPUs as has been proposed by others. To further speed in-place file carving, we would need a mechanism to read disk faster.;Furthermore, we develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host. For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze two strategies to communicate between the host and the GPU and show that one is optimal with respect to run time while the other requires less device memory. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho-Corasick GPU adaptation achieves a speedup between 8.5 and 9.5 relative to a single-thread CPU implementation and between 2.4 and 3.2 relative to the best multithreaded implementation. For the host-to-host case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged between 0.74 and 0.83. Early versions of our multipattern Boyer-Moore adaptations ran 7% to 10% slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer-Moore codes further.
Keywords/Search Tags:String matching, Multipattern boyer-moore, GPU, Algorithms, File carving
Related items