Font Size: a A A

A Density-Based Clustering Algorithm Over Stream Data

Posted on:2006-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:F X ZhaoFull Text:PDF
GTID:2168360155471696Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The wealth of information embedded in huge databases of corporations has spurred a tremendous interest in the areas of knowledge discovery and data mining. As a critical analysis technique, clustering is defined as follows: given n data points in a k-dimensional metric space, clustering is to panition the data points into d clusters such that the data points within a cluster are more similar to each other than that in different clusters.Clustering has been studied extensively by many related research fields such as data mining, statistics, machine learning, and has been applied into several applications such as marketing cluster analysis, spatial database technology, biology, gene analysis of biomedicine, land resource utilization etc. In the field of data mining, clustering mainly aims at large datasets, and emphasizes on the scalability and the validity for the dataset of complex shape and various attributes.Recently, there are more and more applications that are facing the envirnoment of stream data. Stream data is a kind of continuous, ordered, changing, fast and huge amount data. It is quite a new object that is different from traditional static data stored on the disk. The typical stream data include data from network monitoring and traffic engineering system, calling data from telecom departments, inspection data from sensors, stock information from stock exchange, and so on. To deal with this new data object, cluster analysis on stream data becomes an important research branch of clustering.Compared with traditional cluster analysis, cluster analysis on stream data meets much more challenges because of the properties of stream data. For stream data, the following conditions should be satisfied: firstly, limited memory and storage space; secondly, access to data at most one time; thirdly, keeping up the speed of data streams. In this thesis, we analyze what are required for clustering on stream data, and give several recent research results. Currently, most of clustering algorithms for stream data are based on the K-Median or K-Mean, and their drawback is that these algorithms can not be applied into the dataset of arbitrary distributions in shape. Traditionaldensity-based clustering methods, such as DBSCAN, could be adaptable to the dataset of arbitrary shape, but have high computational complexity and scan datasets several times. Space partition-based clustering methods, such as CLIQUE, also have the capacity of finding arbitrary shape clusters and scan datasets only once, but lead to the number of generated cells increasing exponentially varied with the number of dimensions.In this thesis we put forward a novel index structure CD-Tree, which indices the non-empty cells in partitioned space. Based on CD-Tree, to find arbitrary shape clusters, a density-based clustering algorithm over stream data named CDS is proposed. This algorithm consists of two parts: online clustering based on sliding window and offline evolution analysis based on data in disk. Moreover, a measure named DSF is proposed to measure the skew of dataset. With DSF, the granularity adjustment can be implemented dynamically within online clustering, so the accuracy of clustering can be maximized within limited memory. To analyze stream data with more time spans, the algorithm stores the summary of stream data in terms of tilted-time window policy, by which offline evolvement analysis can be performed.To evaluate the performance and effectivity of the algorithm proposed in this thesis, we make extensive of experiments on real-life datasets and synthetic datasets. The results show that our algorithm can get higher accuracy of clustering within limited memory, and has the good scalability with the quantity and the dimensionality of stream data.
Keywords/Search Tags:data mining, clustering, data streams, space partition, CD-Tree, clustering stream data
PDF Full Text Request
Related items