Font Size: a A A

Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array

Posted on:2021-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:1368330614459933Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present,the problem of pattern matching has important application value in many fields,such as information retrieval,text mining,network security,and bioinformatics,in the era of Big Data.Compared with regular expressions and single pattern matching,the problem of multi-pattern matching with wildcards can solve more complex pattern matching problems.Therefore,according to the characteristics of patterns with wildcards,with the help of efficient data structures such as suffix trees,suffix arrays,and their corresponding characteristics,the study of multi-pattern approximate matching with wildcards has important research significance and application value.Existing studies mainly focus on single and exact pattern matching,and there is little research on the problem of multi-pattern approximate matching with wildcards.Suffix trees and suffix arrays are efficient in exact string matching.They are often used to find frequent substrings,longest repeating substrings,longest common prefixes,and palindromes in a string.Most of them are about approximate matching with theoretical analysis,and generally lack of experimental analysis and demonstration.Therefore,the use of suffix trees and suffix arrays for approximate multi-pattern matching with wildcards has important research significance.Based on the analysis and review of existing theoretical research on pattern matching with wildcards,multi-pattern approximate matching,suffix trees and suffix arrays in approximate matching within China and abroad,this dissertation proposes three new algorithms based on the suffix tree and suffix array structures to solve the problem of multi-pattern matching with wildcards.The main contributions and innovations of this dissertation are summarized as follows:(1)A study on the application of suffix trees in approximate pattern matching.Given a pattern P consisting of exact characters and wildcards,the task is to find the number of occurrences and positions of the pattern in an object sequence S.This problem belongs to approximate pattern matching.Suffix trees are mainly used in exact pattern matching and approximate pattern matching for theoretical analysis.Combining the characteristics of suffix trees,this dissertation applies two methods to realize suffix trees in approximate single pattern matching with wildcards.Through algorithm design and comparative experimental analysis,it is concluded that suffix trees can be applied to approximate pattern matching with wildcards.Because it costs extra time to build suffix trees for single pattern matching,this approach has certain limitations.(2)Two multi-pattern matching with wildcards algorithms based on suffix trees are proposed.According to the previous research results,it is found that suffix trees can be applied to approximate pattern matching with wildcards.Combining the fact that a suffix tree is only built once and used many times,two algorithms MMST-S and MMST-L are designed based on the dynamic programming and edit distance methods.These two algorithms are selected according to the length of the exact characters in the pattern.Comparative experiments are designed and conducted on real bioinformatics data.The experimental results show that suffix trees can be applied to approximate multi-pattern matching.When the objected sequence is unchanged,the more the number of matching patterns,the more efficient the MMST algorithms are compared with other comparison algorithms.(3)Two multi-pattern matching algorithms based on suffix arrays(MMSAs)are proposed.A suffix array is a more efficient data structure that is optimized on the basis of a suffix tree.Compared to a suffix tree,a suffix array sorts all suffixes lexicographically and stores them in an array.Although sorting takes some time,sorted suffixes,combined with other properties of suffix arrays,can more effectively exclude more impossible positions.Through experimental demonstration in DNA sequences and protein sequences,it is found that,when the objected sequence is unchanged,compared with other control algorithms,the MMSA algorithms are more efficient as the number of patterns increases,and is better than the algorithms based on suffix trees under certain conditions.
Keywords/Search Tags:Multi-pattern Matching, Approximate Matching, Wildcard, Suffix Tree, Suffix Array
PDF Full Text Request
Related items