Font Size: a A A

Models To Solve Pattern Matching With Bounded Length Gaps And The One-off Constraint

Posted on:2016-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H P WangFull Text:PDF
GTID:1108330488993386Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, it has been widely paid attention on the introduction of wildcard as a special character to match any character in an alphabet in pattern recognition issue, motivated by advances in computational biology and information retrieval fields. However, a great deal of studies on these fields have indicated that multiple patterns frequently appear simultaneously in a paragraph will manifest a new model essentially. For an instance, at the beginning of introns of DNA sequence, the TATA box often appears at the downstream of CAATCT sequence with the 30-50 bp distance. In order to improve the sequence-specific search, this pattern can be indicated as "...CAATCT [30-50]TATA...".Aforementioned model can be described as a sequence set of sub-patterns, where any two adjacent sub-patterns with flexible gaps within a restricted range of the text. Thus, the wildcard model is extended to indicate a string with an bounded length instead of an individual character. And we call this wildcard model as pattern matching with bounded length gaps. We also consider to introduce the one-off constraint to ensure the stability of the model, which can avoid the affects of extraordinary matching frequency of individual sub-patterns. Thus, our study mainly focus on the Pattern Matching with bounded length Gaps and the One-off constraint problem (PMGO). Our main contributions are as follows:(1) For PMGO problem we draw on the experience of constraint satisfaction problems (CSPs) framework and set up a 3-tuple model consisting of variables, domains and constraints. We then derive formal descriptions for the concepts of the bounded length gap, the one-off constraint, and the solution space. We also illustrate the basic properties of PMGO, including the completeness under certain conditions. Otherwise, a one-to-many solution structure is produced with the bounded length gaps, thus drawing on the 3-tuple model we illustrate that search under the one-off constraint in the solution space enables combinatorial optimization;(2) We derive a heuristic algorithm for PMGO problem. We first introduce the concept of a boundary in the solution space, and propose a partitioning algorithm SPLIT to locate the boundary without changing the solution structure, thus dividing PMGO problem into several independent sub-problems. We then analyze the partitioning completeness. The experimental results demonstrate that SPLIT can efficiently reduce the time consumption of non-linear matching algorithms. Based on the 3-tuple model, we then transform the independent sub-problems into several path search problems in a directed acyclic graph (DAG) structure, where a set of all top-down paths in DAG is the candidate space, and a subset of the paths that are independent each other corresponds to a solution. Then a graph-based pruning and matching (GPM) algorithm is presented. GPM algorithm builds a constraint relationship between vertexes under the one-off constraint, then combines the path search with a pruning procedure in an alternating iterative manner. We finally illustrate that the output of GPM algorithm is the solution of PMGO problem. Experiments results demonstrate that the GPM algorithm provides a complementary method to heuristic algorithms, which reduces the loss rate of occurrences;(3) We give a completeness analysis and a heuristic algorithm for PMGO problem under certain conditions. First, we introduce the concept of the pattern character repetitions as a pattern feature. On the one hand, we prove that the completeness of PMGO problem can be guaranteed when the pattern does not have repeat characters. This conclusion applies to the GPM algorithm and the other greedy algorithms. Experiments demonstrate that the impact of the pattern repetitions on PMGO problem using an approximation ratio metric. On the other hand, we present a dynamic pruning algorithm for patterns with repeat characters in tail. Experiments demonstrate that this algorithm can efficiently improve the number of matching occurrences.
Keywords/Search Tags:Bounded Length Gaps, One-off Constraint, Solving Method, Pattern Matching, Wildcards
PDF Full Text Request
Related items