Font Size: a A A

An Efficient Algorithm With Incremental Data Mining For Mining Sequential Pattern(NPSP)

Posted on:2004-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2168360122460460Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, our capabilities of both generating and collecting data have been increasing rapidly. The computerization of many business and government transactions, and the advances in data collection tools have provided us with huge amounts of data, millions of databases have been used in business management, government administration, scientific and engineering data management, and many other applications. It is noted that the mumber of such databases keeps growing rapidly because of the availability of powerful and affordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance.The problem of sequential pattern is an important branch of data mining. The concept of sequential pattern is introduced to capture typical behaviours over time, i.e. behaviours sufficiently repeated by individuals to be relevant for the decision maker. If we are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequence pattern is "5% of customers bought 'Foundation' and 'Ringworld' in one transaction, followed by 'Second Foundation' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we allow the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Third, given a user-defined taxonomy(is a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy.The GSP algorithm and the PSP algorithm is the main two algorithms.They both scale linearly with the number of data-sequences, and have very good scale-up properties with respect to the average data-sequence size, but they both do not have the function of incremental data mining. In this paper, we present an approach, called NPSP(New Perfectly Sequential Pattern),for mining sequential patterns embedded in a database. Different from the algorithm of GSP or PSP, the NPSP algorithm can mining sequential patterns over a database of sequences with increment.In the NPSP algorithm, we use a new data structure and we name it "Heterogeneity Tree". Then we discuss the algorithm in detail. We experimented on the incremental function of the NPSP algorithm using several synthetic data, and empirical evaluation indicates that the incremental idea of the NPSP algorithm is right and is much faster than the normal mining(not using the incremental function). At the same time, NPSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average data-sequence size.
Keywords/Search Tags:Data Mining, Sequence Patterns, Heterogeneity Tree, the NPSP algorithm, Incremental Data Mining
PDF Full Text Request
Related items