Font Size: a A A

Research On Sequential Pattern Mining Algorithms

Posted on:2006-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:E Z GuanFull Text:PDF
GTID:2168360155952937Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining which is also named as "Knowledge Discovery in Database —KDD"is an advanced process of extracting or "mining"believable, novel, effective and understandable knowdege from large amounts of data. Sequential pattern mining is the mining of frequently occurring patterns related to time or other sequences. Sequential pattern mining is a very important field in data mining with broad applications, including the analysis of market baskets, Web access patterns, disease treatments, natural disasters, DNA sequences, and so on. Many studies have contributed to the efficient mining of sequential patterns. However, traditional methods on mining sequential patterns in sequence database may be involved in the problem that the number of frequent subsequences will increase explosively as frequent patterns become longer or the problem that the computational complexity of such algorithms may become very expensive, when sequential patterns are long or the minimum support becomes low. This limits the effectiveness of them, since useful longer frequent patterns have not yet been discovered. And in many applications, it is not unusual that one may encounter databases with large number of sequential patterns and long sequences, such as in DNA analysis or stock sequence analysis. To solve those problemes, in this paper, three novel algorithms, EFSPAN (Effective Frequent Sequential PAtterN mining algorithm) and FFSPAN (Fast Frequent Sequential PAtterN mining algorithm) and MFSPAN (Maximal Frequent Sequential PAtterN mining algorithm) are proposed. The search strategy of EFSPAN integrates a depth-first traversal of the prefix sequence lattice with two effective pruning mechanisms. Experiments show that EFSPAN can avoid searching more than 60% of nodes in the search space when patterns are long and minimum support is low, which minimizes the search space greatly and decreases the high computational complexity. Certainly, most of the skipped nodes are located in deep levels of the search space. So it is very meaningful. The algorithm FFSPAN decreases the high computational complexity in another way. Traditionally, to judge whether a sub-sequence is frequent in a database, one need to compare the whole sub-sequence with every sequence in the original database, however, the algorithm FFSPAN succeeds in solving the problem that in a sequence database, instead of searching a whole frequent sequence, we only need to search a frequent item or a frequent itemset. Moreover, the databases that FFSPAN scans keep shrinking greatly, which makes the algorithm much more efficient when the sequential patterns become long. Experiments on standard test data show that FFSPAN is very effective when the minimum support is low. FFSPAN also demonstrates a good scalability, since it succeeds in completing the mining process with larger database. The idea in FFSPAN can be integrated into most sequence mining algorithms with a little additional cost, since the search space pruning method does not modify the underlying frequent pattern mining algorithm, and it only changes what algorithms are to mine in each step and the size of the database to be scanned. So it proposed a novel way to mine sequential patterns. Sequential pattern mining is a very important field in data mining, but when frequent patterns are long, such mining will generate an exponential number of frequent subsequences, which makes the set of all frequent sequential patterns too large. Some researchers proposed to mine the set of all closed patterns; a frequent set is closed if it has no superset with the same frequency [5]. But for some dense databases, even such a set would grow too large. However, the set of maximal frequent sequences (MFS) is much smaller than the two kinds of sets above. Given an MFS set, it is much easier to obtain some useful information, (e.g. the longest frequent sequence, the overlap of MFS, etc.), and a decision-maker may not be perplexed by too much repetition of redundant information. Moreover, all the frequent sequences can be built up from an MFS set, and if we are interested in a certain sequence, we can get its support by searching the database in a single scan. All of these make mining MFS very meaningful. But most recent algorithms to mine maximal patterns such as Mafia [3], MaxMiner [8], and DepthProject [9], are about mining maximal frequent itemsets in transaction databases, and few works have been done on mining maximal frequent sequences in sequence databases. The novel algorithm MFSPAN (Maximal Frequent Sequential PAtterN mining algorithm) to mine the complete set of maximal frequent sequential patterns in sequence databases is proposed to solve this problem. MFSPAN takes full advantage of the property that different sequences may share a common prefix to reduce itemset comparing...
Keywords/Search Tags:Sequential
PDF Full Text Request
Related items