Font Size: a A A

Research On Clustering And Pattern Mining Techniques For Uncertain Data Streams

Posted on:2015-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y ChenFull Text:PDF
GTID:1108330464468903Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In many data stream applications such as network traffic management, financial data analysis, website log management, video streaming copyright protection, and etc., due to equipment accuracy, noise, interference and privacy protection, uncertainty exists widely in these data streams, which brings great challenges in management and mining data streams. Analyzing the uncertainty of data stream makes it possible to reduce the impact of uncertainty on the mining results, and thus can enhance the quality of the data stream mining.In uncertain data streams mining, the analysis of the uncertainty of data is particularly important for mining quality control. In many applications which contains massive objects, such as traffic monitoring, financial data analysis, website monitoring, etc., the uncertainty of data objects has significant impacts on the similarities among objects and the quality of objects clustering. In the online clustering of data streams such as environmental and meteorological monitoring, the impact of the uncertainty of turple on the quality of micro-clusters should be considered. For mining frequent patterns from uncertain data stream, it is necessary to reflect the frequency distribution of frequent item sets from the perspective of the probability distributions of item sets. In the uncertain sequence pattern mining, it is necessary to measure the frequencies of the probabilistic sequential patterns based on probabilistic models, which requires to expend the existing methods of sequential pattern mining and improve the efficiency of mining of the probabilistic sequence partterns.This dissertation aims to improve the quality of clustering and pattern mining for uncertain data streams, by analyzing the probabilistic feature of uncertain data. The relevant work is supported by the Fundamental Research Funds for the Central Universities and the National Science Foundation of China. The main contents of this dissertation focus on: clustering uncertain objects, clustering uncertain data stream, mining the probabilistic frequent patterns and mining the probabilistic frequent sequential pattern mining. The main content of this dissertation is summarized as follows.The first part focuses on the uncertain objects clustering algorithm based on the synopsis data structures. Firstly, Due to the probability distribution of uncertain objects are not considered in existing clustering, the probability distribution of uncertain objects are modeled in both discrete and continuous domains. Secondly, in order to improve the efficiency of the probability distribution extracting, the synopsis data structures are constructed for the compression of the massive objects data. Then, with the synopsis data structure, the Kullback-Leibler divergence is used to calculate similarity of probability distribution among uncertain objects, and an improved fast Gauss transform technique is adopted to speed up the computational efficiency of similarity. On this basis, extending the existing partitioning based clustering algorithm with the improved KL-divergence as the similarity measure, clustering algorithm KM-KL is proposed based on the probability distribution similarity. Finally, the simulation results show that the improvement of clustering quality and efficiency.The second part focuses on the uncertain data stream clustering algorithm based on quality metrics. Firstly, for the existing uncertain data stream online clustering, a quality metrics model of the micro-clusters is given based on the probability distribution, and histogram models for describing the quality of micro-clusters are constructed. On this basis, a online maintenance strategy based on quality and temporal intervals is proposed to maintain the micro clusters, which divides the buffers according the quality and intervals, and adjusts the buffers according the quality feature of micro-clusters, in order to control the quality and growth of micro-clusters. Then, based on the maintenance strategy, a quality metrics based uncertain data stream clustering algorithm is proposed. Meanwhile, for the high dimensional uncertain data stream, high-dimensional entire space is projected onto the micro-cluster correlation subspaces based on the quality metrics and projection mapping. On this basis, the similarity of micro-clusters is gived in the correlation subspace, and a clustering algorithm of high-dimensional space for uncertain data streams is proposed. Finally, simulation experiments show the low-dimensional and high-dimensional clustering algorithms have higher accuracy and efficiency than the existing algorithms.The third part focuses on the frequent pattern mining algorithms of uncertain data stream based on Sketch model. Firstly, describe the probabilistic feature of frequent patterns based on the possible world model, and optimize the probability of frequent pattern mining by combing the suffix support and the Sketch model. Then, the probabilistic frequent pattern mining is divided into two parts: support mining for frequent pattern and statistics of the probability distribution for frequent items. Based on the suffix support, the construction of frequent pattern tree is optimized, and a suffix support based frequent pattern mining is proposed. Based Sketch and sliding window, the probability distribution of item sets is counted and a probabilistic frequent pattern mining policy is proposed. Meanwhile, based on the probability distribution, prediction model of frequent patterns is designed, and a pruning algorithm based on the prediction model is proposed. Finally, the experimental results show that the suffix support and predictive model mining algorithms can improve the efficiency and accuracy of uncertain data stream frequent pattern mining.The fourth part focuses on uncertain sequential pattern mining algorithm based on pattern-growth framework. Firstly, by the analysis of the sequential uncertain pattern, the measurement of probabilistic sequential pattern is described. Then, through the analysis of the redundancy of the storage structure tree of the existing sequence pattern mining algorithm, by merging the same suffix, based on the directed acyclic graph, a DAG storage structure PG-DAG for the probabilistic sequence is given. In order to enhance the support of PG-DAG for probabilistic sequence partterns, using the weight of the edge of PG-DAG to represent the support values, a weighted PG-DAG(W-PG-DAG) is proposed. Finally, based on probability model the pruning strategy is designed and a mining algorithm for uncertain frequent sequential patterns is proposed. Finally, experiments show that the algorithm can effectively improve the mining efficiency of probability frequent patterns of uncertain sequences, and raise the storage efficiency.
Keywords/Search Tags:uncertain data, synopsis data structure, micro-cluster, suffix support, probabilistic sequential pattern
PDF Full Text Request
Related items