Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System  Posted on:20090506  Degree:Doctor  Type:Dissertation  Country:China  Candidate:F J Ao  Full Text:PDF  GTID:1118360278456545  Subject:Control Science and Engineering  Abstract/Summary:  PDF Full Text Request  System simulation takes advantages of a lot of technologies from other domains, such as computer technology, network technology, graphics and image processing technology, information processing technology, automatic control technology and so on. It is a significant step for system analysis and research. Data mining technology is a powerful tool for discovering hidden knowledge from simulation data. With the increasing of complexity and scale in simulation system, simulation time becomes much longer and simulation data amount becomes much larger, which suggests simulation data exhibits the characteristics of data streams. So it is necessary to process simulation data with data stream mining technology. A data stream is an ordered sequence of data items with the characteristics of continuity, high data rates, infinity, and the distribution of data items changing with time. Those facts bring tremondous challenges to datastream mining. Traditional data mining algorithms aiming at static datasets can't be used to mine data streams directly, neither do they have the time and space efficiency. Thus, it is important to research data stream mining algorithms having higher time and space efficiency, and to aim at resolving data mining tasks often used in system simulation.In all data mining tasks, Association rule mining is the one mostly used in simulation. Mining frequent patterns is the key step to generate association rules. So, the research focuses on the key algorithms for mining frequent patterns in data streams. Particularly, three important aspects are researched and implemented, including the algorithms for mining maximal frequent itemsets, closed frequent itemsets, TopK mostfrequent itemsets in data stream, a classification algorithm based on closed frequent itemsets for data stream and a clustering algorithm based on TopK frequent patterns for high dimensional data stream. Lastly, this paper also discusses how to quickly integrate the data stream mining algorithms into various simulation systems, with the emphasizing on the reusability of the data stream mining algorithms in simulation systems. The main innovative achievements of this research can be summarized as follows:(i) An algorithm for mining maximal frequent itemsets in data stream is proposed. The number of maximal frequent itemsets is much less than that of frequent itemsets or closed frequent itemsets. So, mining maximal frequent itemsets can get higher time and space efficiency. Thus, this paper researches on the technology of mining maximal frequent Itemsets in data stream, aiming at presenting an algorithm that can maintain maximal frequent itemsets in the current sliding window over data streams at any time. The contributions of the research lie in the following three aspects. First, the paper presents a novel pruning technique for mining maximal frequent itemsets in data steam, namely Subset Equivalence Pruning. Second, the paper proposes an onepass algorithm for mining maximal frequent itemsets, called FPMFIDS, in which Subset Equivalence Pruning is used to reduce the size of searching space, thereby to improve the efficiency of the algorithm. Lastly, based on FPMFIDS algorithm, FPMFIDS+ algorithm is proposed which can mine maximal frequent itemsets in sliding window over data streams in online updating fashion. Experiments show that for some dense datasets, the size of searching space can be reduced by about 40 percent by Subset Equivalence Pruning, FPMFIDS achieves high performance and good scalability, and FPMFIDS+ has high updatingmining speed and good stability.(ii) An algorithm for mining closed frequent itemsets in data stream is proposed. The number of closed frequent itemsets is less than that of frequent itemsets, but is more than that of maximal frequent itemsets. Closed frequent itemsets also contain the support of all frequent itemsets. So, mining closed frequent itemsets in data stream is efficient which ensures the perfectibility of information. The paper presents an algorithm for mining closed frequent itemsets, called FPMFIDS, which can efficiently mine closed frequent itemsets over a stream sliding window with limited memory space, and maintain exact closed frequent itemsets in current window at any time. The experimental results show that the algorithm FPCFIDS exhibits more tremendous potential than that of the stateoftheart algorithm Moment in terms of time and space efficiency.(iii) An algorithm for mining TopK mostfrequent itemsets in data stream is proposed. One of the advantages of mining TopK mostfrequent itemsets is that users don't need to specify a minimum support, instead of specifying an integer, k, which is the number of itemsets required. The existing algorithms for mining TopK mostfrequent itemsets have the problems of massive initial items and higher initial border support. To solve these problems, the paper presents an efficient mixedsearchingbased algorithm for mining TopK mostfrequent itemsets, MTKFP. The MTKFP algorithm adopts both breadthfirst searching and depthfirst searching to mine TopK mostfrequent itemsets. Moreover, based on MTKFP algorithm, the paper presents a Chernoffbased algorithm for mining TopK mostfrequent itemsets in data stream, MTKFPDS. The experimental results show that the number of initial items of MTKFP algorithm is 70 percent lower than that of the existing algorithms, but the initial border support of MTKFP algorithm is higher than that of the existing algorithms, consequently the performance of MTKFP algorithm is superior to that of the best existing algorithm by over one time; results also show that MTKFPDS algorithm is suitable for mining data streams.(iv) A closedfrequentitemsets based classification algorithm for classifying data stream is proposed. In contrast to some traditional classification algorithms, the algorithms based on association rules have higher classification precision. These algorithms generally generate classification association rules by frequent itemsets. As mining frequent itemsets often suffers from the problem of combination explosion, the efficiency of algorithm is low. Moreover, the emergence of data streams has posed new challenges to those classification algorithms. To solve these problems, the paper proposes an efficient closedfrequentitemsets based classification algorithm, CBCDS, for classifying data stream. In CBCDS, an efficient onepass algorithm for mining closed frequent itemsets and an effective method for constructing classifier are designed. The experimental results show that the average precision of CBCDS is about 1.09 percent higher than that of CMAR algorithm, but CBCDS is much faster than CMAR.(v) The TopKfrequentpatterns based clustering algorithms for clustering high dimensional data stream are proposed. Clustering high dimensional data is a more difficult problem in the research domain of Clustering. The synthetical method with densitybased clustering and gridbased clustering can be used to solve the problem effectively. Traditional method adopts the procedure of mining frequent itemsets to identify dense units. It has two deficiencies. First, it requires user to specify a minimum density threshold. Second, it identifies dense units with the same minimum density threshold for all subspace, so that the units in sparse subspace can't be identified as dense units. To solve these problems, the paper presents three clustering algorithms for clustering high dimensional data stream. These algorithms base on TopK mostfrequent itemsets, Nmost interesting itemsets and TopK frequent items, respectively, and don't require user to specify a minimum density threshold. The second algorithm is in favor of identifying the dense units in subspace group with specific dimension. The third algorithm is in favor of identifying the dense units in specific subspace. So, the problem of identifying the dense units in sparse subspace is solved. The experimental results show that these algorithms are suitable for clustering high dimensional data stream.(vi) The application of data stream mining technology in simulation system is researched. A simulation application framework based on data stream mining technology is proposed. Moreover, to simplify the process of integrating the data stream mining algorithms into the HLAarchitecturebased simulation system, the paper designs universal data stream mining component and general data stream mining federate based on the idea of modularized development so as to improve the reusability of the algorithms. Lastly, the paper introduces the general federate for mining association rules with the example of MissileBreakthrough simulation system.
 Keywords/Search Tags:  simulation, data stream mining, association rule, frequent pattern, maximal frequent itemsets, closed frequent itemsets, TopK mostfrequent itemsets, classification based on association rules, clustering high dimensional data stream  PDF Full Text Request  Related items 
 
