Font Size: a A A

String Matching Algorithm And Application Of Network Content Analysis

Posted on:2004-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L TanFull Text:PDF
GTID:1118360185496973Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
It's a real challenge for One-Pass, Space-Efficient Algorithms for Massive Data Streams. Massive data sets are increasingly important in a wide range of applications, including Internet and computational biology such as DNA sequences. Searching a stream is widely used in network information monitoring, intrusion detection system and DNA sequences Matching. In network operations, raw data typically arrive in "streams". Decisions must be made by algorithms that make one pass over each stream, throw much of the raw data away, and produce small data structures for further processing. The algorithms can not prepare the text and can handle a very large number of patterns. This thesis aims to string matching algorithms with the background of network content analysis.At first, we summarize string matching algorithms and data stream work of recent years. We discuss some rules to design the new algorithm for massive data streams.After summarization, we give a new string matching algorithm called Int-Match. Int-Match is a fast multiple string matching algorithm using number operations. The approach can scan the text directly as 16-bits short integer array, not 8-bytes characters. It changes the string matching to a unsigned integer numerical computer. To test on the text set, Int-Match is 200% better than the Wu-Manber algorithm. From the practical standpoint, Int-Match is simple and can be extended to other problems. From the theory standpoint, Int-Match is an algorithm with different ideas for useing many hardware advantages.After giving the two new string matching algorithms, we present a combination keyword matching algorithm called ExprAuto. ExprAuto is an automation algorithm that can handle the query expression which combinate the keywords with"and"or"or". For monitoring thousands of network data flows, we adapt the predicate counting algorithm to the automation algorithm. Although ExprAuto uses much more spaces than predicate counting algorithm, it is faster than the predicate counting algorithm. ExproAuto can use O(1) to handle data stream item but the predicate counting algorithm must use O(n) (n is expression size).We have some faster string matching algorithms and some expression matching algorithms for massive data stream .We can use them for network content analysis.We proposed a new technique to speed text categorizing. We use N-gram Chinese...
Keywords/Search Tags:Data stream management system, string match, network monitoring, Pattern matching, Network security, Internet surveillance, Online Text Categorization, Chinese Information Processing, Vector Space Model, N-gram
PDF Full Text Request
Related items