Font Size: a A A

Research On Algorithms For Mining Top-K Frequent Patterns Over Data Streams

Posted on:2010-03-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:B YangFull Text:PDF
GTID:1118360278952583Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data stream is a new data model coming up in recent years. It has been called for in many applications including web click stream, traffic monitoring and management, electrical power management and forecasting, sensor network data analysis, telecommunication management, financial application, business trading and others. Data stream model is different from traditional database. It has the characteristics of rapidness, real-time, continuousness and boundlessness. Therefore, the algorithms for mining data streams have substantial difference from that of traditional database mining. They are one-pass algorithms. It is impossible to maintain all the elements of data streams for the limit of capacity of main memory. Designing a synopsis data structure with far small size compared with that of the data stream to save the summary features of passed data and providing information for queries and analysis over data stream is a good idea. Thus, data stream processing sacrifices the precision of its mining results by allowing some errors. Considering the speed and continuousness of streaming data, the algorithms for mining data streams have to be incremental and efficient in time and space. The existing database management techniques can hardly be applied to process data streams effectively. Mining over dynamic data streams brings unique opportunities but also great challenges.We explore some principle problems of mining data steams in this dissertation and the contributions are as the following:1. Mining top-K frequent patterns in data stream landmark windows dynamically and incrementally. Mining top-K frequent patterns has significant applications in real world. A dynamical incremental approximate algorithm TOPSIL-Miner is presented to mine top-K frequent patterns in landmark windows. A new data structure, TOPSIL-Tree, is designed to store the potential significant itemsets and other data structures of MaxSL(Maximum Support List), OIL(Ordered Item List), TOPSET and MinSL (Minimum Support List) are devised to maintain information about mining results. Moreover, several optimal strategies are exploited to reduce time and space cost of the algorithm. The upper bound of errors of the algorithm is also analyzed theoretically.2. Mining top-K frequent N patterns in data stream sliding windows efficiently. Sliding window model has high application value since it gives more concerns of recent streaming data than old ones. During the mining of sliding windows, we should process not only the new data arriving recently but also the old expired data. Therefore data mining in sliding windows is much difficult than landmark windows. We propose algorithm TOPSIS to mine the top-K frequent N-patterns over data stream sliding windows. A sliding window is divided into several sub windows that are called basic windows serving as updating units. Two new data structures TOPSIS-Tree and RELIS are devised to maintain potential significant itemsets and mined results, respectively. Moreover, three optimal strategies are exploited to reduce time and space consumption of the algorithms: (1) pruning impotent nodes in the newest basic window whenever a sliding window updates, (2) promoting support threshold during mining process dynamically, and (3) adjusting pruning threshold adaptively and heuristically. The error upper bound of the algorithm is also analyzed theoretically.3. Mining top-K frequent closed patterns in data stream sliding windows effectively. Frequent closed pattern is concise expression of frequent pattern. The set of frequent closed patterns determines exactly the complete set of all frequent patterns with accurate support counts and is usually orders of magnitude smaller than that of the frequent patterns. We propose an efficient approximate algorithm for mining top-K frequent closed patterns in data stream sliding windows. We also design a new compressed prefix extension tree—TCIS-Tree which stores not only the summary information of the current sliding window but also the candidates of closed patterns. During the updating and mining of TCIS-Tree several optimal strategies, such as data filtering, heuristically and dynamically adjusting the pruning threshold and mining threshold, and so on, have been explored to make our method efficient. The checking of closed pattern is performed by a two level hash map based on TCIS-Tree, which leads to the effective mining of top-K frequent closed patterns.4. Quantile computation over data streams. Quantile is an important measurement of statistic of datasets. We propose a novel synopsis data structure—Nord_Histogram for storing streaming data summary and a one-pass online approximate algorithm, NORMAL, for quantile computation. The algorithm implements the approximate quantile queries over data stream with the time and space requirements being linear with the number of the buckets, which has no business with the length of data streams, and therefore has great scalability. This method has very good performance on data with uniform distribution. The correlation between the computation accuracy and main memory requirement is also analyzed.Extensive experiments have been conducted to evaluate the time consumption, memory usage and the accuracy of the algorithms mentioned above. Some comparison experiments to other relevant algorithms have also been performed. Experiment results show the good effectiveness and the high efficiency and accuracy of the algorithms.
Keywords/Search Tags:Data stream, top-K frequent pattern, sliding window, landmark window, histogram, quantile
PDF Full Text Request
Related items