Font Size: a A A

Algorithms For Clustering Over Data Streams

Posted on:2007-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:F CaoFull Text:PDF
GTID:1118360212984271Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, various applications generate a large number of streaming data, such as network flow, sensor data, and web page clicks. Mining and analyzing such kind of data has been becoming a hot topic. As a basic method of data mining, clustering in data stream settings has been wildly concerned from academia and industry. Different from traditional data bases, data streams have the following distinguished characteristics: (1)unbounded volume of data; (2)rapid arriving rate of data; (3)uncontrollability of tuples' arriving order; (4) being processed only once for each tuple, except being reserved.Above characteristics propose the following requirements on clustering over data streams: Firstly, algorithms should be online mining, be able to fast process each tuple, and output the mining result on time; Secondly, because of the limited memory compared to unlimited data streams, algorithms should have low space complexity, the space complexity should be in the range of log bound of data volume. Thirdly, by the constraint of online mining and low space complexity, algorithms can only get approximate results, however, that results should be guaranteed on precision. At last, algorithms should be adaptive, including the applicability on data streams with evolving underlying model, the abilities on handling outliers and arbitrary-shaped clusters.A lot of methods on stream clustering have been proposed, however there are many problems need to be researched and resolved. In this paper, we study the the problem of clustering data streams over sliding windows, the problem of discovery arbitrary-shaped clusters in data streams, the problem of clustering data streams via graphic processors and the clustering problem in distributed data stream settings, in order to provide the existing data stream systems more functions on clustering analysis. The main contributions of this thesis include the following aspects:1. We propose a novel method, CluWin, to cluster data streams over sliding windows, and a new synopsis named Exponential Histogram of Cluster Featuresto maintain the statistic information of clusters in sliding windows. CluWin algorithm can provide the clustering results in sliding windows ( window size = N) with relative error no more than e by maintaining O(k/ε log(ε[N/K]) Time Clustering Feature synopses. In addition, it has been extended to deal with clustering problems over the n-of-N sliding window (an extended model of sliding window).2. We propose a new approach, DenStream, to discover clusters with arbitrary shapes in evolving data streams, and introduce a dense micro-cluster(called core-micro-cluster) to describe the arbitrary-shaped clusters in data streams. In addition, we propose potential core-micro-cluster and outlier micro-cluster synopses to maintain and distinguish the potential clusters and outliers in data streams. Based on these concepts, DenStream includes a novel pruning strategy, which can maintain and ensure the precision of the weight of micro-clusters by memory space with log complexity.3. We utilize the powerful, cheaper and cheaper and not enough considered graphic processing units (GPUs) to cluster data streams. We propose a group of fast clustering methods via GPUs, including the basic k-means method, the stream clustering algorithm, and the evolving analysis method over data streams. The common characteristics of these methods are making use of the strong computational and pipeline power of GPUs. Different from pervious clustering methods with individual framework, our methods share the same framework with multi-function, which provides a uniform platform for stream clustering.4. We propose a new framework, CluDistream, for clustering distributed data streams, which can effectively handle a huge volume of data in real-time, the noisy, corrupted or incomplete data records, the overlapping properties of data set in distributed data streams. In CluDistream, the EM-based (Expectation Maximization) algorithms, each data record is assigned to a cluster with certain degree of membership. This soft clustering reflects more naturally the overlapping property of the clusters. In the presence of noisy, corrupted or incomplete data records, our algorithms learn the parameters governing the distribution of underlying data streams by maximizing the likelihood of the data clusters. Moreover, a test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams.All in all, we study four principal problems of clustering data streams and propose novel solutions for them in this thesis. Since sliding window is a basic model for processing data streams, how to cluster data streams over sliding windows becomes a basic problem; arbitrary-shaped clusters model is more general than spherical clusters model, therefore discovery arbitrary-shaped clusters in data streams is a basic problem; the characteristic of online mining makes how to improve the processing speed of algorithm a basic issue in clustering data streams; clustering distributed data streams is a basic problem, because in real life applications, the data streams are often generated in a distributed environment. Our methods are great complementarity and improvement to existing stream clustering methods. Theoretic analysis and experimental results show that our methods can solve their corresponding problems efficiently and outperform previous clustering methods in space complexity, processing rate and result quality.
Keywords/Search Tags:data stream, clustering, evolving, window, graphic processing unit
PDF Full Text Request
Related items