Font Size: a A A

A Multiple Chinese-Patterns Matching Algorithm And Its Parallelization

Posted on:2016-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:S F LiFull Text:PDF
GTID:2348330488973328Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Pattern(String) matching is an important research direction in the field of computer science. It is widely studied and used in both academia and industry. Pattern matching algorithms are widely used in various fields of text processing, such as the network security, information retrieval, text filtering, and bioinformatics. With the rapid development of computer network, mass text retrieval and computational bioinformatics, the number of pattern in the pattern matching problem increases quickly, the challenge of pattern matching algorithm is also growing. Especially in the field of network security in China, the current pattern matching algorithms are difficult to meet the performance requirements of relative large-scale pattern matching system in Chinese character environment. Therefore, it is an urgent need for pattern matching algorithms for relative large-scale pattern matching problems.This dissertation first briefly introduces representative classical pattern matching algorithms such as the BF algorithm, KMP algorithm, BM algorithm, AC algorithm, AC_BM algorithm and WM algorithm. Then, for the environment of Chinese character set, an improved WM algorithm called MSCZ algorithm is proposed,in which the main improvement include: 1. Reduce the high memory space occupation and high conflict of WM algorithm in Chinese character pattern set with two level hash strategy. 2. A strategy of compressing the pattern string is proposed to solve the problem of too long pattern chains in HASH table for the prefix contain situation in Chinese pattern set. 3.Construct the further compressed string chain in HASH table as a binary search tree, which can improve the time complexity of the original WM algorithm of linear time complexity to logarithmic time complexity, and can improve the performance of the algorithm when the pattern set relative large. The experimental results show that MSCZ algorithm uses less time compared with WM algorithm.With the increasing size of the text, the total time of pattern matching for the original serial multi-pattern matching is also increasing. So the serial pattern matching algorithm can't be applicable to massive text. A parallel implementation scheme of MSCZ algorithm based on Hadoop Mapreduce is proposed in this paper, aimed at improving the speed for MSCZ algorithm processing massive data sets. When the input data of MSCZ algorithm contains large number of small files in the Hadoop Mapreduce parallelization process, the system will start up many jobs and result in low system performance. This paper proposes an improved scheme: define the file input format which combines multiple small files as one input, and this will reduce the number of jobs. The experimental results show that the performance of parallel MSCZ algorithm is more efficient than the serial MSCZ algorithm.
Keywords/Search Tags:multi-pattern matching, high performance, large scale, Hadoop, Mapreduce
PDF Full Text Request
Related items