Font Size: a A A

Mining Approximate Frequent Patterns With Wildcards And Gap Lengths

Posted on:2015-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:T N XiangFull Text:PDF
GTID:2308330473959339Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of bio-informatics, network intrusion detection and text retrieval, how to discover meaningful patterns from sequences has become an important problem. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with wildcards and gap-length constraints.During the pattern matching and mining processes, allowing patterns to appear in the subject sequence with editing errors would make the problem closer to the needs of practical applications, especially in bioinformatics. In this thesis, we conduct a study on mining approximate repeating patterns with gap-length constraints, where the appearance of a pattern is subject to an approximate match. Users can specify the range of wildcards between each pair of pattern characters, and the allowed editing errors. This study is expected to not only improve pattern matching and mining, but also bring practical application potentials to several fields.The main contributions of this dissertation are as follows.(1) As with important parameters in traditional sequence matching and mining problems, the character distribution on text and pattern features can help to reveal the problem solving complexity. Based on these features, an expectation model E(Q) = n*D*π(P) is proposed, where n is the length of text T, D is the product of all gap; in pattern P, and π(P) is the character distribution probability of P in T. Experiments on artificial random texts and DNA sequences show that the model prediction error rate is 1.8%-3.2% and 4.7%-7.8% respectively. This model reveals the influence of text length, pattern length and the length of wildcards on Ω in different character distributions. This expectation model E(Ω) can estimate Ω easily in large texts and can be used in the study of the pattern mining problem.(2) To solve the problem of mining approximate patterns, we propose an MARP (Mining Approximate Repeating Patterns with wildcards and gap lengths) algorithm with two major novel components. First, the support ratio is used to measure the pattern’s frequency. In order to do this, we provide a formula to calculate the number of approximate off-set sequences, and an Apriori-like deterministic pruning approach is proposed to progressively prune patterns and stop the search process if necessary. Thus, the efficiency of the mining process is improved. Second, a pattern search process to discover approximate occurrences of a pattern under user specified gap-length constraints is proposed. This method is based on an improved dynamic programming matrix, and takes into account the insertion, deletion and replacement operations simultaneously. Since this method can effectively calculate the approximate occurrences of a pattern, the MARP algorithm can be more flexible for mining approximate frequent patterns.
Keywords/Search Tags:wildcard, gap-length constraints, expectation model, frequenct pattern, approximate pattern mining
PDF Full Text Request
Related items