Font Size: a A A

Research On Contrast Sequential Pattern Mining Based On Subsequence Distribution Variation

Posted on:2020-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2428330620951118Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Contrast sequential pattern mining is an important research task in data mining,which aims to discover the differences between different classes of sequence data.How to efficiently mine meaning and easily-to-analysis patterns from sequence data is a major problem that needs to be solved in current researches.At present,researchers have designed many algorithms for mining contrast sequential pattern.However,most algorithms are based on the number of occurrences or support frameworks,ignoring the effect of subsequence distribution on patterns.Although existing algorithm consider the location information of subsequences in emerging sequence pattern mining,it uses fixed location to identify the distribution differences of different subsequences in different classes of sequence data,i.e.,the subsequence pattern that appears before the given distinguishing location in one sequence dataset and after the same location in another sequence dataset.Without sufficient prior knowledge,it is difficult for users to set appropriate location thresholds.Since the distinguishing location is different for different subsequences,setting a fixed location threshold may ignore many meaningful patterns.Considering that a large amount of sequence data contains time tags,its time attribute is also a non-negligible factor in the analysis of sequence data.Designing an algorithm that can automatically analyze the time distribution difference of event will help decision makers make the right decision.In addition,with the generation of a large amount of biological data,it is an urgent problem to study the methods that can automatically analyze the differences of different classes of biological sequences.However,previous studies centered on contrast sequential pattern mining did not consider the effect of spatial location distribution of genes/amino acids on given biological sequences.In response to the above questions,the main contributions of this dissertation are as follows:(1)Proposed a contrast sequential pattern mining method based on subsequence time distribution variation and satisfying discreteness constraints.Based on the design of the suffix tree,the algorithm first maps all the suffix substrings generated by each sequence in the dataset to the each path of the tree.In this tree,the node is used to save the time information and the counts of the item.Then,each node in the tree is visited through the depth-first search method to mine patterns that satisfy the corresponding conditions.At the same time,a discreteness constraint for the time series is proposed to ensure the compactness of the subsequence time distribution.The experimental results on the user behavior datasets and online retail datasets show that the proposed algorithm can mine more meaningful patterns,and has better classification performance.(2)Proposed a method for mining contrast sequential patterns based on subsequence spatial location distribution from biological sequences.The algorithm maps each instance and all its suffix substrings of the dataset to each path of the tree and mines patterns satisfying the corresponding conditions in a depth-first manner.The difference from the contrast pattern tree based on the subsequence time distribution variation is that each node stores the location information and the counts of the item,and the performance of the pattern tree is further optimized.The experimental results show that it is meaningful to use the proposed pattern for the mining of biological sequences,and using the pattern as a classification feature can improve the classification performance of the algorithm.
Keywords/Search Tags:Contrast sequential pattern, Subsequence distribution variation, Discreteness constraint, Classification
PDF Full Text Request
Related items