Data streams model appears widely in many applications, such as economic, military, logistical, financial, telecom and other applications. Due to equipment accuracy, transmission loss, ambient interference, equipment failure, privacy protection and integration between different systems and other reasons, uncertainty in the data stream is also widespread. Therefore, designing data mining algorithms over uncertain data streams has become a hot research topic. Studies on frequent itemsets mining as an important part of the data stream mining work have gone through more than ten years. However, previous studies are mainly based on deterministic data, they can not be directly used in uncertain data stream due to the introduction of the probability. How to mine frequent itemsets considering the uncertainty becomes an important issue in the area of uncertain data stream mining area.This dissertation summarizes the uncertain phenomena and issues of uncertain data management and analyzes the state of arts about frequent itemset mining algorithm, then proposes several algorithms suitable for uncertain data stream. Extensive experiments have been conducted and the results show the efficiency of the proposed algorithms. Specifically, the major work includes:(1) Based on the sliding window model, the efficient algorithms are proposed to detect frequent items on uncertain data stream. In order to reduce the computation cost, the algorithm incrementally updates the results upon each window slide and thus eliminates the need for complete recomputation from scratch. The pruning rules are proposed, which can reduce the candidate set and improve the query efficiency. The detailed analysis and experimental results demonstrate the efficiency and scalability of the algorithms.(2) Based on the sliding window model, an efficient incremental algorithm is proposed to detect Top-K frequent items on uncertain data stream. According to the current data in the sliding window, items to be frequent in the future are predicated. The pruning rules are proposed, which can reduce the candidate set and improve the query efficiency. In order to reduce the space complexity, instead of maintaining candidate sets independent, a single integrated structure is proposed to represent the candidate sets for all windows.(3) In this dissertation, an exact incremental mining algorithm is proposed for finding probabilistic frequent itemsets from the uncertain streaming data with a sliding window. In the algorithm, a new tree structure called the compressed probabilistic frequent-pattern tree (CPFP-Tree) is designed to efficiently keep the related information in the mining process. In the CPFP-Tree, the same items will be merged in a branch of the tree even with different probabilities. A mining algorithm called CPFP-mine is proposed based on the tree structure to find uncertain frequent patterns. The algorithm incrementally updates the results upon each window slide, and pruning rules are designed to reduce the candidate set and improve the execution time efficiently. The efficiency of the algorithm is demonstrated by experiments.(4) Frequent itemset mining over probabilistic database has attracted much attention recently. However, existing solutions may lead to an exponential number of results due to the downward closure property over probabilistic data. Thus, in order to obtain a reasonable result set with small size, a condensed representation such as closed patten or generator is adopted. Because the generator is preferable to the closed pattern in inductive inference and classification, the problem of threshold-based frequent itemsets generator over probabilistic data is studied in this dissertation. The formal definition of probabilistic itemset generator and its computing method are first proposed. Then, an efficient mining algorithm is developed to obtain all probabilistic frequent itemset generators. Several pruning techniques are further designed to reduce the search space and avoid redundant computation. The efficient algorithm to mine frequent itemset generators over a stream sliding window is also develeped, adopting a novel enumeration tree structure to help keep the information of mined generators and the border between generators and non-generators. Some optimization techniques to speed up the mining process are devised. The effectiveness and efficiency of the proposed methods are verified through extensive experiments. |