Approximation Frequent Pattern Mining In Biological Sequences

Posted on:2018-10-28

Degree:Master

Type:Thesis

Country:China

Candidate:E M Yuan

Full Text:PDF

GTID:2310330542992632

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of bioinformatics,the implementation and completion of human genes and protein sequencing help to accumulate massive biological data.Mining frequent patterns from biological data facilitates the identification of potential information in biological sequences and the detection of gene and protein homology,which has become an important task in the research field of biological information.In this dissertation,we study the problem of approximate frequent pattern mining with wildcards and gap constraints from biological sequence,with the edit distance or match score of biological character as the measurement of similarity between pattern and sequence.Besides,we define three kinds of approximation operation including insertion,replacement and deletion of characters,which overcome the weakness of the existing approximate frequent patterns mining algorithms which only contain one kind of approximation operation that the replacement of characters.The main work of this dissertation is as follows:?1?We define three kinds of approximation operations including insertion,replacement,and deletion of characters,and some other definitions about approximate pattern matching and mining are also given.In order to decide the pattern is frequent pattern by frequency of patterns we introduce the deduction of pattern's compensation sequence numberN_l,editmaxin the problem of approximate pattern matching.Because of the huge candidate solution space of the problem,we design an Apriori-like pruning strategy to reduce its size.?2?Based on edit distance to measure the similarity between sequences,we design matrix of edit distance A-EDM and it's construction method,which records the smallest edit distance?or error?between pattern's substring and sequence's substring.On the base of A-EDM,we devise a matrix MST of candidate solution set and approximate pattern matching algorithm APM to calculate the number and the position of approximate occurrence of patterns in sequence.Then,we put forward the approximation frequent pattern mining algorithm?MAPA?,in which when edit distance threshold edit_max=0,MAPA algorithm turns into an exact frequent patterns mining algorithm.?3?For the specificity of the frequent pattern mining in biological sequences,we design approximate match score matrix MSM between pattern and sequence by combining edit distance constraints with score matrix of biological characters,and MSM records the maximum matching score between pattern's substring and sequence's substring.Based on MSM,we design an approximate pattern matching algorithm S-APM for biological sequence,which apply the strategy of back-track in MSM to calculate pattern's approximation occurrences in sequence.Afterward,we propose approximate frequent patterns mining algorithm?MAPS?and multiple sequence common frequent patterns mining algorithm?co-fp-miner?for biological sequences.The algorithm of co-fp-miner improves the Apriori-like pruning rule and devises the pruning strategy,which achieves better performance.Experiments validated our algorithms'effectiveness and the advantage in solutions compared with the classical exact frequent pattern mining algorithm?MPP?and the approximate frequent pattern mining algorithm?ArpGap?.

Keywords/Search Tags:

pattern match, pattern mining, frequent pattern, wildcards gap constraints, approximate operations

PDF Full Text Request

Related items

1	Research On Frequent Pattern Mining With Big Social Graph
2	Research On Distributed Frequent Pattern Mining From Vehicle GPS Trajectories
3	Negative Sequence Pattern Mining Technology And Application Of Biological Information
4	Mining Content Pattern And Constructing Tupu Of Chinese Provincial Comprehensive Atlas
5	Using Regression Methods To Estimate The Length Of The Longest Frequent Patterns With Periodic General Gap Constraints
6	Research On Distributed Frequent Graph Pattern Mining Method Based On Simulation Matching
7	Research On Organelle Communication Patterns Based On Frequent Subgraph Mining
8	On The Spectral Properties Of Matrix Patterns
9	Study Of Complex Pattern Mining In Time Series Data Stream
10	Analysis Of Land Use Landscape Pattern And Construction Of Ecological Security Pattern In Mining Cities