Font Size: a A A

Nonoverlapping Closed Sequential Pattern Mining

Posted on:2021-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:C R ZhuFull Text:PDF
GTID:2518306560953039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Sequential pattern mining(SPM)is a kind of data mining method which finds frequent subsequences(patterns)whose occurrence times(support)are more than given threshold in a single sequence or sequence database.This method has been applied in many fields,which are closed to many industries in our life.The traditional sequential pattern mining mainly aims at mining the complete set of subsequences.In fact,there exist a large number of redundant patterns in the complete set of frequent patterns,which makes a huge pattern size,resulting in the algorithm time consumption and space overhead.In order to overcome this problem,closed sequential pattern mining strategy is introduced into this paper,to mine the closed pattern subset of the sequence whose support is not equal to its super pattern.In addition,this paper uses nonoverlapping condition that allows different characters of pattern to use the same sequence character,reducing the generation of redundant pattern.The research of this paper shows that the nonoverlapping closed sequence pattern mining has high research value and application value.The research work of this paper are as follows:1.In this paper,we give the definition of the nonoverlapping closed pattern mining,make an analyzation on the existing related algorithms,and prove that the nonoverlapping closed pattern mining algorithm can get complete solution.With pattern growth strategy to generate candidate patterns,this paper proposes a nonoverlapping closed pattern mining algorithm NetNCSP(Nettree for Nonoverlapping Closed Sequential Pattern).It is proved theoretically that NetNCSP is a complete algorithm satisfying Apriori property,the analysis of time complexity,space complexity and Apriori property are given.2.There are two main steps of NetNCSP,support calculation and closeness determine.In the procedure of support calculation,Back Tr is proposed based on the Nettree structure to reduce the complexity from O(m*m*n*w)to O(m*n*w),and it is proved that Back Tr is a complete algorithm.In the procedure of closeness determine,three strategies,Inherit,Predict and Closeness determine,are proposed to determine the closeness of patterns and prune the redundant patterns.3.In order to verify the efficiency of NetNCSP,this paper proposes 6 comparative algorithms,in which,NetNCSP-bf ? NetNCSP-df generate candidate patterns using breadth-first and depth-first strategies respectively;NetNCSP-netgap applies NETGAP strategy to calculate the pattern support;NetNCSP-noinh and NetNCSP-nocheck are proposed to determine the closed patterns without Inherit,Predict and Closeness determine,respectively;NetNCSP-nogap mine the continuous closed patterns without gap constraints.4.The comparative experiment results in DNA,virus sequence and other long sequence data show that,in the comparation of the similar algorithms,NetNCSP has the best mining performance and good pattern compression with different sequence database,pattern length,support threshold,gap constraint.Application experiments on the novel coronavirus pneumonia(COVID-19)virus SARS-Co V-2 sequence show that,the SARS-Co V-2 and SARS have the same pattern composition,but there are some differences in pattern combinations.
Keywords/Search Tags:Sequential pattern mining, Closed pattern mining, Periodic wildcard gaps, Nettree, COVID-19
PDF Full Text Request
Related items