Font Size: a A A

Research And Apply On Algorithms For Clustering Data Stream

Posted on:2009-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:P X TangFull Text:PDF
GTID:2178360242494604Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data Mining, also referred to as Knowledge Discovery from Database, is to abstract the potential, unknown and useful information or pattern from the large database or data warehouse. It integrates the theories and technologies of database, artificial intelligence, machine learning and statistics etc. And it is a very promising study field in the research of database with the high valuable application. Clustering Analysis is one of the most important technologies in data mining. Clustering algorithms partition a data set into several disjoint groups such that points in the same group are similar to each other according to some similarity metric.In recent years, because of computer and apply a technical high speed development, people obtain the ability of data to get a tremendous exaltation, Data Streams is a type of important data source, and is subjected to more and more concern, management system and its algorithm based on data streams model have become an important applied topic.A data stream is an ordered sequence of high-speed points which are coming continuously. Usually a data stream can be regarded as a dynamic data set, the scale of which increases infinitely with the lapse of time. As a result, it is too expensive to access a point in stream randomly, and thus requiring a single scan over the stream has become an object of the clustering algorithms. Clustering on streaming data brings forward challenges for traditional algorithms in the following aspects: obtaining high quality clusters by only one-pass over the data; time-window analysis over an arbitrary period of the stream etc. In recent years, stream algorithms have developed into a two-phase structure. Usually, a dual framework includes two parts: the on-line component and the off-line component. The former is responsible for the fast but rough processing of streaming data and saving the summary information to meet the one-pass restriction while the latter takes advantage of the information to conduct high-level analysis. At present, stream algorithms are still facing some problems, for example: bad quality of clusters due to the loss of global information caused by dividing the stream; high time complexity etc.This paper has made a profound reach on the clustering algorithms on data stream, and put forward some methods based on the dual-tier framework to solve the problems above, including: 1) The expression of the data is an important problem in the on-line algorithm, it will influence the method with which the algorithm handle the data and the efficiency. Landmark model, sliding window model and the snapshot model are all expressions based on the data compressing, they mapping the data into a much smaller space according to the value of the data. The micro-cluster can obtain more information about the data by recording the distribution of the points,so as to get a lower requirement of the storage. This paper prompts a type of the micro-cluster, which stores the points in it. Therefore, the points can be moved in the after process to reflect the change of the data distribution better.2)The on-line algorithm outputs the mid-data, which the off-line algorithm gets as the input. This paper prompts an mix strategy that begining use complete-partition strategy and afterwards use an incomplete-partition strategy. This is based on the lemma that the dense areas in the local space always correspond to dense areas in the global space. And the other points remaining in the memory are waiting for the future treatment together with later points. The past beginning starts of complete divide the line of cluster which end the density if is high output, but the partition its density is low will to handle with follow-up data together. Ths strategy can improve the quality of the mid-data from the on-line tier, and thus, the algorithm can get better clustering results.3) A improve dual-tier clustering algorithm for data streams, SCluStream, is proposed in this paper. It can reflect the distribution in real data space better. The algorithm maintain the global information more effectively when firstly clustering data streams. This makes the SCluStream have a stronger adaptability to the dynamic characteristic of the stream. The empirical evidence shows that this method can obtain high-quality.4) This paper puts forward a new algorithm, DenCluStream, which can be used to data mining the arbitrary shape cluster for data stream. We led the density function into the data structure as the weight, and use dense micro-cluster to describe the arbitrary-shaped clusters in data streams. In addition, we propose candidate core-micro-cluster and isolated micro-cluster synopses to maintain and distinguish the potential micro clusters and isolation point in data streams. The result on-line outputted is saved as high dimensional ball cluster at out-line, and conserve the save space.In addition, this paper discuss the application about cluster algorithm provisionally, and analyze apply current situation of cluster algorithm presently. It looks forward application prospect for the study in future.
Keywords/Search Tags:stream, clustering, dual-tier framework, density function
PDF Full Text Request
Related items