Font Size: a A A

Research On The Sequential Pattern Mining Algorithms Using Prefix-tree Structure

Posted on:2014-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:PHAM THI THIETFull Text:PDF
GTID:1268330401473955Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Together with the rapid development of computer and internet technology, the hugeamounts of data have been gathered together from various kinds of applications become moreenormous and have far exceeded our human power for apprehension without powerful tools.They have been described as a data rich but information poor situation. Therefore, datamining with the aim of finding the valuable information and necessary knowledge hidden in avast amount of data has become one of the most important tasks in the field of data miningresearch. The variety and richness of data have formed different data kinds include transactiondata, sequence data, stream data, time-series data and so on.Sequence data is an important type of data which occurs frequently in many applications.It is composed of sequences of ordering elements or events, listed with or without a specificnotion of time. Although there is the existence of a lot of general data mining methods toother kinds of data but for sequence data, these methods could not be applied because ofamong all kinds of data, sequence data has its own unique sequence features and can beexisted in many interesting applications which leads to many interesting new kinds ofknowledge to be discovered including sequential patterns, approximate biological sequencepatterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds ofpatterns will assist the development of new classification, clustering and outlier analysismethods, which in turn call for new, the development of different application kinds.The sequential pattern mining is one of important tasks of data mining research and oftenused popular in sequence data mining applications. The process of sequential pattern miningis to extract frequent subsequences in a sequence database. This work has also attracted muchmore attention to researchers in data mining research. Many works has been examined onmining sequential patterns, however, the main challenges still exist as large search spaces andthe ineffectiveness in handling dense datasets. To resolve the above challenges, the problemsfor mining closed sequential pattern, sequential generator pattern, and sequential rules havebeen proposed. In this thesis, we have proposed novel algorithms to address these problemswith the following two main objectives:Exploitation of secondary information as sequential pattern, closed sequential pattern,sequential generator pattern based on the corresponding prefix-tree structures.Generate the kinds of sequential rules based on the secondary information in theprefix-tree structure. In this thesis, we have four mainly contributions which can be briefly described asfollows:Firstly, this thesis mentions several interestingness measures as Lift, Conviction,Piatetsky-Shapiro, Cosine, Jaccard and so on, which have proposed for mining associationrules and classification rules but they have not been applied to mine sequential rules insequence databases except the traditional measures of rule such as the support and confidence.We also propose then an efficient algorithm to generate all relevant sequential rules with theabove interestingness measures from the prefix-tree which stored the whole sequential patternwhere each child node stores a sequential pattern and its corresponding support value. Bytraversing the prefix-tree, the algorithm can then easily identify the components of a rule, andcan calculate the measured values of the rule. The experimental results show that sequentialrule mining with interestingness measures using the proposed algorithm based on theprefix-tree was always much faster than that using the other existing algorithm as modifiedFull. Especially when mining in large sequence databases with the low minimum supportvalues, the number of sequential patterns generated from sequence databases was large andthe proposed algorithm outperformed much because the proposed algorithm only traverse theprefix-tree to immediately determine which sequences are the left-and right-hand sides of arule as well as their support values to compute the interestingness measure values of the rulefrom the sequential pattern set. In addition, the experimental results also show that the timefor mining sequential rules with the confidence measure was the smallest, because it did notneed to revisit the prefix-tree to determine the support of Y (the antecedence of rules), whilethe other interestingness measures need to revisit the prefix-tree to determine the supportvalues of the consequent of rules or both the antecedence and the consequent.Secondly, in this thesis, the characteristics of sequential generator patterns arecombined with the extension of a sequence on the prefix-tree to propose two efficientalgorithms, called MSGPs and MSGP_PreTree, for finding all the sequential generatorpatterns at the same time of the generating sequential patterns. Using the prefix-tree, newsequences, which are child nodes, can be easily created by appending an item to the lastposition of a parent node as an itemset extension or a sequence extension. The proposedalgorithms use the prime block encoding approach to represent candidate sequences and usesjoin operations over the prime blocks to determine the frequency for each candidate. In theMSGPs algorithm, it uses a hash table to store sequential generator patterns with the hash keyas the support of the pattern for fast checking. MSGP_PreTree algorithm that is improvedfrom the MSGPs algorithm, to generate all sequential generator patterns. The idea of theimproved algorithm is performed by modifying the prefix-tree such that each node on the prefix-tree will be added fields to check whether the sequence stored in this node is asequential generator pattern or not. The whole information of the sequence is stored on theprefix-tree, so the MSGP_PreTree algorithm does not need to use a hash table to storesequential generator patterns, which reduce signifcantly the use of memory. Thesupersequence frequency-based pruning and the non-generator-based pruning on theprefix-tree are applied in the MSGP_PreTree algorithm to reduce the search space. Theprocess of extending prefix-tree and determining sequential pattern in the MSGP_PreTreealgorithm is performed similar to the MSGPs algorithm. All the experimental results forsynthetic and real databases show that the number of sequential generator pattern is alwayssmaller than the number of sequential patterns, and in all cases the proposed algorithmsoutperform the other algorithm in terms of running time.Thirdly, we propose an efficient algorithm for directly finding both closed sequentialpatterns and their sequential generator patterns in the generating sequential patterns processcalled CloGen algorithm (Closed sequential pattern-sequential Generator pattern), which isbased on the combination of the child-parent relationship on prefix-tree structure and thedefinition of closed sequential pattern and sequential generator pattern. Each node on theprefix-tree in our approach stores a sequential pattern and its corresponding support value.Besides, it will be added one field (IsmSGP) to consider whether this node is a minimalsequential generator pattern, and another field (IsCSP) to consider whether this node is aclosed sequential pattern. Based on these fields added to each node, the algorithm easilydetermines if the sequence at each node is a minimal sequential generator pattern or closedsequential pattern, so the mining time is reduced significantly. This algorithm also uses joinoperations over the prime block encoding approach of the prime factorization theory torepresent candidate sequences and determine the frequency for each candidate. Experimentalresults show that the performance runtime for mining closed sequential patterns and theirminimal sequential generator patterns using the CloGen algorithm is much faster than oneorder of magnitude. The CloGen algorithm can generate all sequential patterns, sequentialgenerator patterns, and closed sequential patterns at the same time. Furthermore, the builtprefix-tree in the our approach will be one of the most efficient prefix-trees for miningnon-redundant sequential rules in the future and also for mining all sequential rules.Fourthly, an efficient algorithm called MNSR-Pretree for mining non-redundantsequential rules is proposed in this thesis. The proposed algorithm is decomposed two phases.In the first phase, it builds a prefix-tree that stores all the sequential patterns from a givensequence database. Then in the second phase, it mines non-redundant sequential rules fromthis prefix-tree. In the prefix-tree building process, each node on the prefix-tree has a field (IsmSGP) that indicates whether this node is a minimal sequential generator pattern, andanother field (IsCSP) that indicates whether this node is a closed sequential pattern, which isperformed by the CloGen algorithm in the previous contribution. By traversing the prefix-tree,non-redundant sequential rules can be easily mined from a minimal sequential generatorpattern X to a closed sequential pattern Y such that X is a prefix of Y, which greatly reducesthe mining time required. Based on the values of IsmSGP and IsCSP, the MNSR-Pretreealgorithm only mines rules from a parent node whose IsmSGP value is true to children nodeswhose IsCSP value is true, so that the sequence at the parent node is considered as anantecedent of the rules to be generated, and the consequents of rules are generated byremoving the prefix part, which the sequence at the parent node has, from closed sequentialpatterns. The experimental results on synthetic and real databases show that the number ofnon-redundant sequential rules is much smaller than that of sequential rules, and that the timerequired for mining non-redundant sequential rules is much less than that required for miningsequential rules. Besides, the results also show that the time required for miningnon-redundant sequential rules of the proposed algorithm is less than that required by anexisting algorithm.In summary, in this thesis we have proposed the efficient algorithms and also completedthe initial introduced objective is that "To improve the efficiency of the exploitation ofsecondary information algorithms as sequential pattern, closed sequential pattern, sequentialgenerator pattern based on the prefix-tree structure" with the main contribution is "the use ofthe prefix-tree in order to generate significantly the kinds of sequential rules as sequentialrules with interestingness measures and non-redundant sequential rules from the secondaryinformation". The goal of this thesis has been achieved by using the child-parent relationshipon the prefix-tree structure and the extension of sequences to propose novel algorithms formining works related to sequential patterns in the sequence database including algorithms formining sequential rules with interestingness measures, mining sequential generator patterns,mining closed sequential patterns and their sequential generator patterns and miningnon-redundant sequential rules. The above proposed methods can be evaluated with bothsynthetic and real datasets. Experimental results illustrate the effectiveness and efficiency ofour algorithms, which improved significantly the efficiency.
Keywords/Search Tags:Sequential pattern, closed sequential pattern, sequential generator pattern, interestingness measure, sequential rule, non-redundant sequential rule, prefix-tree
PDF Full Text Request
Related items