Font Size: a A A

Clustering transient data streams by example and by variable

Posted on:2010-12-08Degree:Ph.DType:Dissertation
University:University of Maryland, Baltimore CountyCandidate:Chaovalit, PimwadeeFull Text:PDF
GTID:1448390002477142Subject:Information Science
Abstract/Summary:
Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other data sources are referred to as "data streams". Various data mining tasks can be performed on data streams in search of interesting patterns. This dissertation studies a particular data mining task, clustering, which can be used as the first step in many knowledge discovery processes. By grouping data streams into homogeneous clusters, data miners can learn about data characteristics which can then be developed into classification models for new data or predictive models for unknown events. Data streams clustering calls for data clustering techniques that require only a single pass access of data and a very short processing time per data point. Moreover, the system will likely have to discard data that have already been viewed. Therefore, suitable techniques are needed to incrementally update the clustering model. We propose a novel method called POD-Clus (Probability and Distribution-based Clustering) that complies with the above requirements for data streams clustering. This dissertation covers two paradigms for data streams clustering. In clustering by example, data points collected from the same data source can have different cluster assignments. Alternatively, clustering by variable treats each stream as one unit and all data points from the same stream must stay in the same cluster. We demonstrate that POD-Clus is applicable to both paradigms. POD-Clus also handles situations when clusters evolve. Cluster evolutions are relevant to data streams clustering since the nature of clusters from the boundless streams may change considerably over time. We include the following types of cluster evolutions: cluster appearance, cluster disappearance, cluster splitting, and cluster merging. The methodologies in this dissertation are grouped into (a) clustering by example without evolution, (b) clustering by example with evolution, (c) clustering by variable without evolution, and (d) clustering by variable with evolution. We conducted experiments on POD-Clus and compared against recent data streams clustering algorithms. Results show significant improvements in clustering results using POD-Clus as compared to competing algorithms.
Keywords/Search Tags:Data streams, Clustering, Example, Variable, Pod-clus
Related items