Clustering transient data streams by example and by variable

Posted on:2010-12-08

Degree:Ph.D

Type:Dissertation

University:University of Maryland, Baltimore County

Candidate:Chaovalit, Pimwadee

Full Text:PDF

GTID:1448390002477142

Subject:Information Science

Abstract/Summary:

Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other data sources are referred to as "data streams". Various data mining tasks can be performed on data streams in search of interesting patterns. This dissertation studies a particular data mining task, clustering, which can be used as the first step in many knowledge discovery processes. By grouping data streams into homogeneous clusters, data miners can learn about data characteristics which can then be developed into classification models for new data or predictive models for unknown events. Data streams clustering calls for data clustering techniques that require only a single pass access of data and a very short processing time per data point. Moreover, the system will likely have to discard data that have already been viewed. Therefore, suitable techniques are needed to incrementally update the clustering model. We propose a novel method called POD-Clus (Probability and Distribution-based Clustering) that complies with the above requirements for data streams clustering. This dissertation covers two paradigms for data streams clustering. In clustering by example, data points collected from the same data source can have different cluster assignments. Alternatively, clustering by variable treats each stream as one unit and all data points from the same stream must stay in the same cluster. We demonstrate that POD-Clus is applicable to both paradigms. POD-Clus also handles situations when clusters evolve. Cluster evolutions are relevant to data streams clustering since the nature of clusters from the boundless streams may change considerably over time. We include the following types of cluster evolutions: cluster appearance, cluster disappearance, cluster splitting, and cluster merging. The methodologies in this dissertation are grouped into (a) clustering by example without evolution, (b) clustering by example with evolution, (c) clustering by variable without evolution, and (d) clustering by variable with evolution. We conducted experiments on POD-Clus and compared against recent data streams clustering algorithms. Results show significant improvements in clustering results using POD-Clus as compared to competing algorithms.

Keywords/Search Tags:

Data streams, Clustering, Example, Variable, Pod-clus

Related items

1	The Application And Research Of Incremental Clustering On Temporal Data Streams
2	Research And Implementation On Clustering Algorithms In Uncertain Data Streams Environment
3	Research Of Optimized Clustering Algorithms Over Data Streams
4	Study On Key Technologies Of Frequent Items Mining And Clustering On Data Streams
5	The Research And Realization Of Clustering Algorithm In Data Streams Mining
6	Studies On Clustering Algorithms For Categorical Data
7	Researchon Real-time Data Streams Clustering Framework
8	Uncertain Clustering Method And Its Application In Data Streams Processing
9	Research On Clustering Algorithm Based On Subspace In High-dimensional Data Streams
10	Algorithms For Data Streams Based On Shielding/Summarizing