Font Size: a A A

Algorithm For Clustering Data Streams Based On Density Units Covered

Posted on:2008-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:H X ShiFull Text:PDF
GTID:2208360215960479Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the wide usage of information technology, data generated from different information systems become more and more. How to utilize the huge original data to analyze current situation and make prediction effectively, have already become a great challenge. Recently, a growing number of applications generate streams of data, such as network flows, sensor data and web click streams. They are temporally ordered, fast changing, massive, and potentially infinite. Analyzing and mining such kinds of data have been becoming a hot topic.Clustering is an important task in mining evolving data streams. Clustering, an unsupervised classifying method, is the process of grouping similar multi-dimensional data vectors into a number of clusters. Compared with traditional cluster analysis, cluster analysis on stream data meets much more challenges because of the properties of stream data. For stream data, the following conditions should be satisfied: firstly, limited memory and storage space; secondly, access to data at most one time; thirdly, have little response time.Besides the limited memory and one-pass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and ability to handle outliers. While a lot of clustering algorithms for data streams have been proposed, they offer no solution to the combination of these requirements. Traditional density-based clustering methods, such as DBSCAN, could be adaptable to the dataset of arbitrary shape, but have high computational complexity and scan datasets several times.This paper proposed a new data stream clustering algorithm under sliding window model, called DucStream, which based on the density unit cover. DucStream can discover clusters with arbitrary shape, and has solved the historical data influence on current clusters because under the data stream sliding window model memory cannot precisely record each data. We use core density unit and the candidate density unit as online data synopses to portray the data distributed form. We will prune the unit according to the recent data arrival time in each unit, that guaranteed the online data synopses is the smallest cover of current window data in the data stream. Finally obtain the final output according to the request. The experimental result over realistic and the artificial data set demonstrates good performance of the DucStream algorithm.
Keywords/Search Tags:Data Stream, Clustering, sliding window, density unit
PDF Full Text Request
Related items