Font Size: a A A

Approximation Frequent Pattern Mining In Biological Sequences

Posted on:2018-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:E M YuanFull Text:PDF
GTID:2310330542992632Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of bioinformatics,the implementation and completion of human genes and protein sequencing help to accumulate massive biological data.Mining frequent patterns from biological data facilitates the identification of potential information in biological sequences and the detection of gene and protein homology,which has become an important task in the research field of biological information.In this dissertation,we study the problem of approximate frequent pattern mining with wildcards and gap constraints from biological sequence,with the edit distance or match score of biological character as the measurement of similarity between pattern and sequence.Besides,we define three kinds of approximation operation including insertion,replacement and deletion of characters,which overcome the weakness of the existing approximate frequent patterns mining algorithms which only contain one kind of approximation operation that the replacement of characters.The main work of this dissertation is as follows:?1?We define three kinds of approximation operations including insertion,replacement,and deletion of characters,and some other definitions about approximate pattern matching and mining are also given.In order to decide the pattern is frequent pattern by frequency of patterns we introduce the deduction of pattern's compensation sequence numberNl,editmaxin the problem of approximate pattern matching.Because of the huge candidate solution space of the problem,we design an Apriori-like pruning strategy to reduce its size.?2?Based on edit distance to measure the similarity between sequences,we design matrix of edit distance A-EDM and it's construction method,which records the smallest edit distance?or error?between pattern's substring and sequence's substring.On the base of A-EDM,we devise a matrix MST of candidate solution set and approximate pattern matching algorithm APM to calculate the number and the position of approximate occurrence of patterns in sequence.Then,we put forward the approximation frequent pattern mining algorithm?MAPA?,in which when edit distance threshold editmax=0,MAPA algorithm turns into an exact frequent patterns mining algorithm.?3?For the specificity of the frequent pattern mining in biological sequences,we design approximate match score matrix MSM between pattern and sequence by combining edit distance constraints with score matrix of biological characters,and MSM records the maximum matching score between pattern's substring and sequence's substring.Based on MSM,we design an approximate pattern matching algorithm S-APM for biological sequence,which apply the strategy of back-track in MSM to calculate pattern's approximation occurrences in sequence.Afterward,we propose approximate frequent patterns mining algorithm?MAPS?and multiple sequence common frequent patterns mining algorithm?co-fp-miner?for biological sequences.The algorithm of co-fp-miner improves the Apriori-like pruning rule and devises the pruning strategy,which achieves better performance.Experiments validated our algorithms'effectiveness and the advantage in solutions compared with the classical exact frequent pattern mining algorithm?MPP?and the approximate frequent pattern mining algorithm?ArpGap?.
Keywords/Search Tags:pattern match, pattern mining, frequent pattern, wildcards gap constraints, approximate operations
PDF Full Text Request
Related items