Font Size: a A A

NetDAP:(?,?)-approximate Pattern Matching With Length Constraints

Posted on:2021-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:J Q FanFull Text:PDF
GTID:2518306560453544Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Pattern matching(PM)with gap constraint has been applied to compute the support of a pattern in a sequence.It is the core technology of sequential pattern mining and has wide applications in biology,security monitoring and data mining.Traditional PM is divided into exact PM and approximate PM.Due to the large amount of noise interference in the actual data,approximate PM is more flexible and more valuable patterns can be found.Approximate PM with gap constraints mainly adopts the Hamming distance to measure the approximation degree which only reflects the number of different characters between two sequences,but ignores the distance between different characters.Hence,this paper addresses(?,?)approximate PM with length constraints which employs local-global constraints to improve the accuracy of the PM,namely,the maximal distance between two corresponding characters is not allowed to exceed the local threshold?,and the sum of all the?distances must not exceed the global threshold?,It can improve the flexibility of matching and avoid the problem that the approximate gap is too large.The main contents of this paper are as follows:(1)We presented the DAP problem,and some definitions are given.(2)This paper proposes an effective online algorithm,named NetDAP,which employs a special designed data structure named approximate single-leaf Nettree.An approximate single-leaf Nettree can be created by adopting dynamic programming method to determine the range of leaf,the minimal root,the maximal root,the range of nodes for each level,and the range of parents for each node.To improve the performance,two pruning strategies are proposed to prune the nodes and the parent-child relationships which do not satisfy the?and?distance constraints,respectively.(3)On the basis of proving the completeness of the algorithm,the time and space complexity of the algorithm in this paper are given,which are O(g~2*?*n*m~2)and O(m~2*g*?+n)respectively,where g,n,?and m is the maximum gap is the sequence length,is the global approximate threshold and the pattern length,respectively.(4)Finally,extensive experimental results on real protein data sets and time series verify the performance of the proposed algorithm.
Keywords/Search Tags:approximate PM, length constraints, (?,?) distance, single-leaf Nettree
PDF Full Text Request
Related items