Font Size: a A A

Study On Automaton-Based Regular Expression Matching Algorithms

Posted on:2012-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiFull Text:PDF
GTID:2248330395958139Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computer science, the amount of information data increases exponentially, which brings great challenges to the work of data processing, and the queries from the users becomes more complicated. Due to the scale of the data needed to process is getting larger and larger, searching on it becomes more difficult. As a power tool for defining complicate queries, the regular expression provides a flexible and efficient method for the problem of processing the search on text. Nowadays, the application of regular expression has involved in many fields, which makes it convenient to process the queries. How to answer the regular expression queries rapidly and efficiently is of vital importance.At present, a lot of methods have been proposed to solve the problem of regular expression matching. These methods mostly search the regular expression online, which means the data file has been processed beforehand. According to the theory of matching, they can be divided into three types:regular expression matching based on NFA, regular expression matching based on DFA, regular expression matching based on filtering strategy. The last method which has the better query performance is commonly used now. However, the existing methods based on filtering strategy only effective for regular expressions which have some special structure, and the structure of the regular expression is very complicated, how to select a better filter method to answer the queries and improve the search performance is a challengable work.According to the characteristics of the regular expressions, this thesis proposes online query processing and offline query processing technology for regular expressions. When the data file is not processed beforehand, the algorithm extracts the set of best factors which are most representative from the regular expressions, and then choose the BM algorithm or CW algorithm based on the number of best factors to search the candidate strings which include the best factors as their substrings, finally verify the candidates on DFA to gain the results. When the data file is processed beforehand, the thesis proposes three types of index, basic suffix tree index, optimal suffix tree index and cluster index. The method based on suffix tree index searches the candidate strings with the best factors on suffix tree and verifies the candidates. The method based on cluster index uses the cluster algorithm to divide the strings into cluster set and then extract the common strings from each cluster. The strings in each cluster will be filtered if at least one of common strings cannot be recognized by the DFA. Finally, the experimental tests based on real and simulated datasets show that the proposed online and offline query processing can answer the regular expression matching efficiently.
Keywords/Search Tags:regular expression, automaton, suffix tree index, cluster index, pre-filter, k-means clustering
PDF Full Text Request
Related items