Font Size: a A A

Efficient Complex Pattern Matching Approaches On Large-scale Text Collections

Posted on:2020-09-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:T QiuFull Text:PDF
GTID:1488306353963129Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the computer technology has been widely used in many applications,the processed data is also grow rapidly.The tremendous volume of data brings the new challenge for the query processing technique.Meanwhile,the user has a higher demand for the query processing technique and the simple query cannot meet the need of user,so the complex pattern matching shows more significances for data management systems.Complex pattern matching includes several concrete matching problems,such as regular expression matching,similar matching,fuzzy matching and so on.For example,in genomics,regular expression is used to depict the common structures of DNA sequences and Protein sequences,then it is used to find the matching subsequences from the biosequence database.There is also an application of similar subsequence matching in genomics,in order to concatenate the short reads(i.e.,DNA subsequences),a read is mapped to a long reference sequence using the technique of similar subsequence matching.In the area of knowledge base,a regular expression is used to describe the common characteristics of entities,then entities are extracted from the datasets by matching the corresponding regular expressions.Besides,similar subsequence matching is also widely used for data cleaning and integration.Thus,it is valuable to study the efficient complex pattern matching techniques in theory and practice.In this dissertation,we focus on two complex pattern matching problems:regular expression matching and similar subsequence matching.Several efficient indexbased matching algorithms and optimization techniques are proposed.To be more specific,the main contributions of this dissertation are list as follows:Firstly,considering the weak filtering ability of existing filtering-based regular expression matching methods,a novel filtering technique for the regular expression matching problem is proposed,called negative factor.The filtering algorithms using negative factors are developed by analyzing the relationship of negative factor and existing positive filtering factors.To improve the efficiency of filtering,the bit vector based structure BITINDEX is proposed to represents the occurrences of each character using a bit vector,and the bit-parallel filtering algorithms are designed,which utilize the bit vectors of filtering factors computed from BITINDEX.We show the differences of negative factors in filtering ability and query efficiency,and define a set of high-quality negative factors,called core negative factors.Several efficient algorithms are designed to compute the core negative factors from a regular expression query on-the-fly.Besides,the impact of matching direction on the query efficiency is studied,and an adaptive method is proposed to choose a good matching direction which improves the verification efficiency.Secondly,the existing filtering-based regular expression matching algorithms are inefficient to match the complex regular expression queries.To solve this problem,a regular expression matching method is proposed,which can find all matching occurrences purely based on the q-gram inverted index,and avoids the verifications using an automaton on the original data.Hence,this method can be easily integrated into the existing q-gram inverted index based applications without building the new indexes for datasets.A gram-driven NFA is first proposed to represent the language of a regular expression,and the positional constraints on the positions of q-grams required by an occurrence are analyzed using GNFA.The GNFA-based query plan is devised,which computes occurrences by checking the positional constraints through GNFA with the q-gram positions in a positional inverted index.By considering the selectivity of q-grams,a more efficient tree-based query plan is proposed,which optimizes the checking order for positional constraints.Thirdly,we analyze the existing similar subsequence matching algorithms and show that they suffer from the drawbacks of producing large numbers of candidate positions and generating many redundant signatures.To solve these problems,a novel filtering technique(called hybrid signature)is proposed,which enables the signature contains a different number of subsequences from the read query and avoids generating large numbers of redundant signatures.The cost of producing candidate positions using hybrid signatures is analyzed,and a cost-benefit model is designed to generate high-quality hybrid signatures,which achieves a better balance between the filtering ability of signatures and the overhead of producing candidate positions.Besides,the adaptive algorithms are designed to produce candidate positions using hybrid signatures,and the optimizations are proposed to further accelerate the efficiency of producing candidate positions.In summary,for the regular expression matching and similar subsequence matching problems,we propose the new and effective filtering techniques to improve the matching performance,and design the efficient index-based matching algorithms and optimization techniques.Extensive experiments are conducted on various realworld datasets,and the experimental results demonstrate the proposed methods outperform the state-of-the-art comparative methods in the query efficiency.
Keywords/Search Tags:complex pattern matching, regular expression matching, bit-parallel algorithms, similar subsequence matching, inverted index
PDF Full Text Request
Related items