Font Size: a A A

Research On Data Stream Clustering Algorithms

Posted on:2007-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:D B ZhouFull Text:PDF
GTID:2178360182496036Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the coming of the age of information, commerce, telecom, medicaltreatment and other industries are being confronted with a large amount offast-arriving data more and more frequently. The importance of streamingdata analysis gradually stands out in the domain of data mining, andclustering on data streams has become a heated topic.A data stream is an ordered sequence of high-speed points which arecoming continuously. Usually a data stream can be regarded as a dynamicdata set, the scale of which increases infinitely with the lapse of time. As aresult, it is too expensive to access a point in stream randomly, and thusrequiring a single scan over the stream has become an object of the clusteringalgorithms. Clustering algorithms partition a data set into several disjointgroups such that points in the same group are similar to each other accordingto some similarity metric. Clustering on streaming data brings forwardchallenges for traditional algorithms in the following aspects: obtaining highquality clusters by only one-pass over the data;time-window analysis over anarbitrary period of the stream etc. As for stream clustering, a common methodis dividing the streaming data into chunks, and algorithms for static sets canbe used on each sub-set separately. In recent years, stream algorithms havedeveloped into a two-phase structure. Usually, a dual framework includes twoparts: the on-line component and the off-line component. The former isresponsible for the fast but rough processing of streaming data and saving thesummary information to meet the one-pass restriction while the latter takesadvantage of the information to conduct high-level analysis. At present,stream algorithms are still facing some problems, for example: sensitive tothe initial data points;bad quality of clusters due to the loss of globalinformation caused by dividing the stream;high time complexity etc.This paper has made a profound reach on the clustering algorithms ondata stream, and put forward some methods based on the dual-tier frameworkto solve the problems above, including:1) The expression of the data is an important problem in the on-linealgorithm, it will influence the method with which the algorithmhandle the data and the efficiency. Landmark model, sliding windowmodel and the snapshot model are all expressions based on the datacompressing, they mapping the data into a much smaller spaceaccording to the value of the data. The micro-cluster can obtain moreinformation about the data by recording the distribution of the points,so as to get a lower requirement of the storage. This paper prompts anew type of the micro-cluster, that is, the grid-cluster, which storesthe points in it. Therefore, the points can be moved in the afterprocess to reflect the change of the data distribution better. This paperprompts another data structure called N-Dim Sphere, which isdesigned for the stream clustering based on the density. The N-DimSphere can describe the distribution of the data from the on-line tierwith a smaller error by considering the radius of the micro-clusters.This make it has the ability to detect the irregular dense area in thedata space.2) The on-line algorithm outputs the mid-data, which the off-linealgorithm gets as the input. The complete-partition strategy designateeach point into a micro-cluster only according to the temporarydistribution of the current data, but this may not be correct. Thispaper prompts an incomplete-partition strategy which only outputsthe points in areas with high density. This is based on the lemma thatthe dense areas in the local space always correspond to dense areas inthe global space. And the other points remaining in the memory arewaiting for the future treatment together with later points. Theincomplete-partition strategy can improve the quality of the mid-datafrom the on-line tier, and thus, the algorithm can get better clusteringresults.3) A novel dual-tier clustering algorithm for data streams, AGCluStream,is proposed in this paper. It will avoid accumulating points forinitialization, and is insensitive to the initial data points.AGCluStream does not divide the stream into chunks and maintainthe global information more effectively, thus, the result clusters canreflect the distribution in real data space better. The agent operationsare defined on the grid-clusters, points are moved to grid-clusterswhere there are more similar points ceaselessly to reflect thedistribution timely. This makes the AGCluStream have a strongeradaptability to the dynamic characteristic of the stream. Theempirical evidence shows that this method can obtain high-qualityclusters with low time complexity.4) At present, most of the clustering algorithms base on density onlyoperate on the abstract points outputted by the on-line tier, but nottake the distribution information in these points into account. Thispaper puts forward a density clustering algorithm bases on theN-Dim Sphere data structure. The algorithm finds the areas with highoverlap-dense, then expands the areas to find clusters in arbitraryshape. Experiment evaluations show the excellent performance of ouralgorithm in the aspects of effectiveness and efficiency.Otherwise, this paper introduces the function, usage and developmentdetails of the Clustering Tools Software (CTS), including: Data Generator,Distribution Viewer and Clustering Algorithm Modules. The software can beused for data cleaning, comparison between algorithms, etc. And any moduleof the software can be embedded into new algorithms to shorten thedevelopment period. CTS adopts an open framework, and publicizes all ofthe source code, thus it is easy to be maintained and upgraded.
Keywords/Search Tags:Clustering
PDF Full Text Request
Related items