Font Size: a A A

Pattern Matching With Variable Wildcards

Posted on:2015-03-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:1268330428974529Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The information on the Internet shows explosive growth from the rapid development of the Internet, and large amounts of data and information are produced in many application fields. How to extract useful information quickly and effectively from the large amounts of information has become an important research topic. Efficient pattern matching algorithms are expected to quickly obtain user-specified information from massive information. Pattern matching is a classical problem, and it becomes more complicated if there may be flexible wildcards between adjacent characters in a given pattern P. Pattern matching with wildcards and length constraints (PMWL) has not only theoretical research value but also realistic applications in gene comparison, information retrieval and network security.This thesis analyzes the pattern matching problem with variable length wildcards. We study three research issues:(1) the key character locations in a given text;(2) a text segmentation algorithm, which combined with the suffix tree, improves the retrieval efficiency of pattern matching in both time and space; and (3) the completeness of the PMWL problem with a novel data structure of CluTree. A retrieval algorithm based on the CluTree can further improve the matching completeness and time efficiency.The main contributions of the thesis are as follows.(1) A study on the key character locations in a given text;In PMWL, a wildcard between adjacent characters makes the matching problem very complex. When the length of text T is relatively large and the distribution of characters is uneven, existing algorithms have large positioning errors. As existing algorithms do not consider the distribution of the matching text T, and do not take into account the different roles of different characters in P, the different characters have different functions of various sub-patterns. The retrieval of a key character may reduce the efficiency of the whole matching process. This thesis proposes an evaluation mechanism for character importance in P, according to the appearance difference of each character in P and the distribution of P in T. At first, we calculate the frequency function of P and a self-frequency function. According to these function results, we can find the function of P, and the key character pk. The matching algorithm based on key character search seeks to find the key character as the location character, then make use of the matching algorithm SAIL and improve the scanning form. When the form is constructed successfully, the algorithm gets a matching solution. The algorithm returns all matching results after the scanning of text T is finished.(2) A text segmentation algorithm, which combined with the suffix tree, improves the retrieval efficiency of pattern matching in both time and space;To divide and conquer pattern matching in a massive text is an effective strategy to resolve the performance bottleneck problem of massive string matching. Our algorithm divides text T into multiple windows without destroying a matching substring, and searches in the most possible area of each window. Our multi-window mechanism and retrieval optimization method greatly decrease the retrieval time overload. The segmentation of a suffix tree can improve the space efficiency of suffix tree algorithms. The design used in the field of bio informatics, for identification of DNA sequences and protein sequence matching, can improve the efficiency of pattern matching. How to correctly segment is the core problem. In this thesis, our design algorithm can find a segmentation character from T, which does not affect the matching results. Based on the forward process in SAIL for determining the solution space using local constraints, we can eliminate all the matching areas in which constraints are not satisfied. In the process of realization, our algorithm scans each initial segmentation point, and this point may not match with the next point, then ends this cycle and returns to the current scanning position as the correct segmentation position. The segmentation significantly improves the efficiency of retrieval on massive texts.(3) A study on the completeness of the PMWL problem with a novel data structure of CluTree.Because of the emergence of wildcards, the search space in the PMWL problem is more complicated than traditional pattern matching (TPM) which has no wildcards. There is no disputing in TPM that the matching position between the text and the pattern is uniquely determined for each occurrence. However in the PMWL problem, a candidate matching position in text T may appear as a sub-pattern position or as a wildcard position. The diversity of the position to be matched leads to matching uncertainty. There may be multiple candidate substring matching positions nested within each other. How to choose each correct matching position in order to obtain the optimal number of matches is a difficult problem to be solved. Existing algorithms have adopted the left-most strategy in dealing with the entanglement pattern matching problem in a nested region. The strategy is too simple to solve such a problem:when the pattern is more complex and the matching areas in T are nested, some matching substrings will inevitably be lost. This thesis presents a search algorithm RBCT based on a CluTree structure. CluTree is a cluster of trees with red and black nodes, and each tree in the CluTree may have multiple sub-trees. CluTree is established in nested matching areas in T. The pattern matching process selects a matching path in accordance with the sharing degree, correlation degree and mixed information entropy of each node in a CluTree. Our RBCT algorithm uses dynamic pruning techniques to ensure that groups of trees in a CluTree are dynamically updated and the sharing nodes are released. RBCT traverses all the nodes in a CluTree and finds more occurrences compared to the existing algorithms under the one-off condition.
Keywords/Search Tags:Wildcard, Pattern Matching, One-off Condition, Key Character Location, Segmentation, CluTree
PDF Full Text Request
Related items