Font Size: a A A

Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System

Posted on:2009-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:F J AoFull Text:PDF
GTID:1118360278456545Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
System simulation takes advantages of a lot of technologies from other domains, such as computer technology, network technology, graphics and image processing technology, information processing technology, automatic control technology and so on. It is a significant step for system analysis and research. Data mining technology is a powerful tool for discovering hidden knowledge from simulation data. With the increasing of complexity and scale in simulation system, simulation time becomes much longer and simulation data amount becomes much larger, which suggests simulation data exhibits the characteristics of data streams. So it is necessary to process simulation data with data stream mining technology. A data stream is an ordered sequence of data items with the characteristics of continuity, high data rates, infinity, and the distribution of data items changing with time. Those facts bring tremondous challenges to data-stream mining. Traditional data mining algorithms aiming at static datasets can't be used to mine data streams directly, neither do they have the time and space efficiency. Thus, it is important to research data stream mining algorithms having higher time and space efficiency, and to aim at resolving data mining tasks often used in system simulation.In all data mining tasks, Association rule mining is the one mostly used in simulation. Mining frequent patterns is the key step to generate association rules. So, the research focuses on the key algorithms for mining frequent patterns in data streams. Particularly, three important aspects are researched and implemented, including the algorithms for mining maximal frequent itemsets, closed frequent itemsets, Top-K most-frequent itemsets in data stream, a classification algorithm based on closed frequent itemsets for data stream and a clustering algorithm based on Top-K frequent patterns for high dimensional data stream. Lastly, this paper also discusses how to quickly integrate the data stream mining algorithms into various simulation systems, with the emphasizing on the reusability of the data stream mining algorithms in simulation systems. The main innovative achievements of this research can be summarized as follows:(i) An algorithm for mining maximal frequent itemsets in data stream is proposed. The number of maximal frequent itemsets is much less than that of frequent itemsets or closed frequent itemsets. So, mining maximal frequent itemsets can get higher time and space efficiency. Thus, this paper researches on the technology of mining maximal frequent Itemsets in data stream, aiming at presenting an algorithm that can maintain maximal frequent itemsets in the current sliding window over data streams at any time. The contributions of the research lie in the following three aspects. First, the paper presents a novel pruning technique for mining maximal frequent itemsets in data steam, namely Subset Equivalence Pruning. Second, the paper proposes an one-pass algorithm for mining maximal frequent itemsets, called FPMFI-DS, in which Subset Equivalence Pruning is used to reduce the size of searching space, thereby to improve the efficiency of the algorithm. Lastly, based on FPMFI-DS algorithm, FPMFI-DS+ algorithm is proposed which can mine maximal frequent itemsets in sliding window over data streams in online updating fashion. Experiments show that for some dense datasets, the size of searching space can be reduced by about 40 percent by Subset Equivalence Pruning, FPMFI-DS achieves high performance and good scalability, and FPMFI-DS+ has high updating-mining speed and good stability.(ii) An algorithm for mining closed frequent itemsets in data stream is proposed. The number of closed frequent itemsets is less than that of frequent itemsets, but is more than that of maximal frequent itemsets. Closed frequent itemsets also contain the support of all frequent itemsets. So, mining closed frequent itemsets in data stream is efficient which ensures the perfectibility of information. The paper presents an algorithm for mining closed frequent itemsets, called FPMFI-DS, which can efficiently mine closed frequent itemsets over a stream sliding window with limited memory space, and maintain exact closed frequent itemsets in current window at any time. The experimental results show that the algorithm FPCFI-DS exhibits more tremendous potential than that of the state-of-the-art algorithm Moment in terms of time and space efficiency.(iii) An algorithm for mining Top-K most-frequent itemsets in data stream is proposed. One of the advantages of mining Top-K most-frequent itemsets is that users don't need to specify a minimum support, instead of specifying an integer, k, which is the number of itemsets required. The existing algorithms for mining Top-K most-frequent itemsets have the problems of massive initial items and higher initial border support. To solve these problems, the paper presents an efficient mixed-searching-based algorithm for mining Top-K most-frequent itemsets, MTKFP. The MTKFP algorithm adopts both breadth-first searching and depth-first searching to mine Top-K most-frequent itemsets. Moreover, based on MTKFP algorithm, the paper presents a Chernoff-based algorithm for mining Top-K most-frequent itemsets in data stream, MTKFP-DS. The experimental results show that the number of initial items of MTKFP algorithm is 70 percent lower than that of the existing algorithms, but the initial border support of MTKFP algorithm is higher than that of the existing algorithms, consequently the performance of MTKFP algorithm is superior to that of the best existing algorithm by over one time; results also show that MTKFP-DS algorithm is suitable for mining data streams.(iv) A closed-frequent-itemsets based classification algorithm for classifying data stream is proposed. In contrast to some traditional classification algorithms, the algorithms based on association rules have higher classification precision. These algorithms generally generate classification association rules by frequent itemsets. As mining frequent itemsets often suffers from the problem of combination explosion, the efficiency of algorithm is low. Moreover, the emergence of data streams has posed new challenges to those classification algorithms. To solve these problems, the paper proposes an efficient closed-frequent-itemsets based classification algorithm, CBC-DS, for classifying data stream. In CBC-DS, an efficient one-pass algorithm for mining closed frequent itemsets and an effective method for constructing classifier are designed. The experimental results show that the average precision of CBC-DS is about 1.09 percent higher than that of CMAR algorithm, but CBC-DS is much faster than CMAR.(v) The Top-K-frequent-patterns based clustering algorithms for clustering high dimensional data stream are proposed. Clustering high dimensional data is a more difficult problem in the research domain of Clustering. The synthetical method with density-based clustering and grid-based clustering can be used to solve the problem effectively. Traditional method adopts the procedure of mining frequent itemsets to identify dense units. It has two deficiencies. First, it requires user to specify a minimum density threshold. Second, it identifies dense units with the same minimum density threshold for all subspace, so that the units in sparse subspace can't be identified as dense units. To solve these problems, the paper presents three clustering algorithms for clustering high dimensional data stream. These algorithms base on Top-K most-frequent itemsets, N-most interesting itemsets and Top-K frequent items, respectively, and don't require user to specify a minimum density threshold. The second algorithm is in favor of identifying the dense units in subspace group with specific dimension. The third algorithm is in favor of identifying the dense units in specific subspace. So, the problem of identifying the dense units in sparse subspace is solved. The experimental results show that these algorithms are suitable for clustering high dimensional data stream.(vi) The application of data stream mining technology in simulation system is researched. A simulation application framework based on data stream mining technology is proposed. Moreover, to simplify the process of integrating the data stream mining algorithms into the HLA-architecture-based simulation system, the paper designs universal data stream mining component and general data stream mining federate based on the idea of modularized development so as to improve the reusability of the algorithms. Lastly, the paper introduces the general federate for mining association rules with the example of Missile-Breakthrough simulation system.
Keywords/Search Tags:simulation, data stream mining, association rule, frequent pattern, maximal frequent itemsets, closed frequent itemsets, Top-K most-frequent itemsets, classification based on association rules, clustering high dimensional data stream
PDF Full Text Request
Related items