A Multiple Chinese-Patterns Matching Algorithm And Its Parallelization

Posted on:2016-08-16

Degree:Master

Type:Thesis

Country:China

Candidate:S F Li

Full Text:PDF

GTID:2348330488973328

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Pattern(String) matching is an important research direction in the field of computer science. It is widely studied and used in both academia and industry. Pattern matching algorithms are widely used in various fields of text processing, such as the network security, information retrieval, text filtering, and bioinformatics. With the rapid development of computer network, mass text retrieval and computational bioinformatics, the number of pattern in the pattern matching problem increases quickly, the challenge of pattern matching algorithm is also growing. Especially in the field of network security in China, the current pattern matching algorithms are difficult to meet the performance requirements of relative large-scale pattern matching system in Chinese character environment. Therefore, it is an urgent need for pattern matching algorithms for relative large-scale pattern matching problems.This dissertation first briefly introduces representative classical pattern matching algorithms such as the BF algorithm, KMP algorithm, BM algorithm, AC algorithm, AC_BM algorithm and WM algorithm. Then, for the environment of Chinese character set, an improved WM algorithm called MSCZ algorithm is proposed,in which the main improvement include: 1. Reduce the high memory space occupation and high conflict of WM algorithm in Chinese character pattern set with two level hash strategy. 2. A strategy of compressing the pattern string is proposed to solve the problem of too long pattern chains in HASH table for the prefix contain situation in Chinese pattern set. 3.Construct the further compressed string chain in HASH table as a binary search tree, which can improve the time complexity of the original WM algorithm of linear time complexity to logarithmic time complexity, and can improve the performance of the algorithm when the pattern set relative large. The experimental results show that MSCZ algorithm uses less time compared with WM algorithm.With the increasing size of the text, the total time of pattern matching for the original serial multi-pattern matching is also increasing. So the serial pattern matching algorithm can't be applicable to massive text. A parallel implementation scheme of MSCZ algorithm based on Hadoop Mapreduce is proposed in this paper, aimed at improving the speed for MSCZ algorithm processing massive data sets. When the input data of MSCZ algorithm contains large number of small files in the Hadoop Mapreduce parallelization process, the system will start up many jobs and result in low system performance. This paper proposes an improved scheme: define the file input format which combines multiple small files as one input, and this will reduce the number of jobs. The experimental results show that the performance of parallel MSCZ algorithm is more efficient than the serial MSCZ algorithm.

Keywords/Search Tags:

multi-pattern matching, high performance, large scale, Hadoop, Mapreduce

PDF Full Text Request

Related items

1	Research Of High Performance Multiple Pattern Matching Algorithm For Large-scale Pattern Set
2	Performance Optimization Of The Multi-Pattern Matching Algorithm For The Large-scale URL Keywords
3	Research On Large-scale Multi-pattern Matching Algorithms
4	Large-scale and high-throughput pattern matching on parallel architectures
5	Hadoop Based Parallel Genetic Algorithm For Large Scale VRPTW
6	Research Of High-Performance String Matching Technology For Large Scale String Set
7	Research On Distributed SVM Algorithm Based On Hadoop Platform
8	Study On Deploying And Optimizing MapReduce Framework On High Performance Computer
9	The Optimization Of High Performance MapReduce FairScheduler And The Implementation On Simulator Of Huge Scale Cluster
10	Large-scale Ontology Matching On MapReduce