Font Size: a A A

Study On Clustering Algorithms Over Data Stream

Posted on:2011-07-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:R X WanFull Text:PDF
GTID:1118360302480221Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and the wide application of information technology, a growing number of data appear in people's lives with the form of "continuous data streams", such as computer-network flow, financial services, environmental monitoring and log records analysis, etc. These data, referred as data stream in academia, have some distinguished characteristics including rapid arriving rate, being unrepeatable read, unbounded volume, etc. These characteristics require that the processing algorithms for data streams should have low temporal complexity and low spatial complexity. The traditional database-based processing technology has failed to accomplish the processing requirement of data stream. Recently, analyzing and processing such kind of data has been becoming a hot topic in the academic community.As an important data analysis and processing technique, clustering has been receiving considerable attention among database community. AS various applications generate data in the form of stream, clustering data stream has been attracted more and more attention of researchers. However, the characteristics of data stream prevent most of traditional clustering algorithms to work well in data stream setting. Thus, people have to study some appropriate clustering methods corresponded to data stream.The academic community has done amounts of research work on clustering data stream, but there still exist many problems that should be further studied and resolved. In this thesis, we study the clustering algorithm of data stream from different aspects. The major work of this thesis can be summarized as follows:(1) Explore the issue of clustering data stream with arbitrary shapes. Discovering arbitrary shaped clusters in data stream has a great significance in many applications. However, most of the existing data stream clustering algorithms is limited by their capability of discovering the spherical clusters. Though there are several algorithms (e.g. DenStream) that are reported to have the capability of clustering arbitrary shaped data stream, they have some deficiencies such as the requirement of inputting too many system parameters and so on. In this thesis, we present a grid-based clustering algorithm (GridStream). GridStream stores statistical information by grid, and generates clusters by the iterative merging strategy. The experimental results show that the proposed algorithm can efficiently discover the arbitrary shaped clusters in data stream, and the algorithm has a strong immunity to noise. Compared to other similar algorithms, GridStream can provide the user a more "friendly" operation.(2) Explore the issue of clustering data stream with mixed attributes.Extensive applications require the clustering approaches should have the capability of handling different typical data, but currently, most of the data stream clustering algorithms is designed for a single type of data, they just simply discard other typical data, as a result, the quality of clustering is discounted. In order to solve the mixed-attribute data stream clustering problem, we present a measure for mixed attribute data streams, and based on this measure, we propose two clustering algorithms for mixed-attribute data stream as dCluStream and MStream. dCluStream maintains the statistical information of micro-clusters by the equivalent dissimilarity matrix. Through different levels of dissimilarity, dCluStream will get different clustering effects. MStream divides the data space into some segments, and then maps each data object to a corresponding space segment, the statistical information of micro-clusters are stored in the cluster list. Both of these algorithms can not only be able to handle the data stream with mixed attributes, compared with similar algorithms (e.g. HCluStream), they also have less computation time.(3) Explore the issue of fuzzy clustering data stream.The traditional fuzzy clustering methods should scan the global data set multiple times, and are sensitive to noise and the selection of the initial cluster center, thus, fuzzy clustering on data stream is facing enormous difficulties and challenges. In this thesis, we present a divide-and-conquer solution (SWFCM), that is, firstly, we divide data stream into data chunks according to the order of data arrival time, and then incrementally weight and cluster the arriving data. Since the processed data in each iteration are limited in finite data segments, thus SWFCM can overcome the defect of traditional fuzzy clustering that it should repeatedly scan the global data set. Theoretical analysis and experimental results show that, in the data stream environment, SWFCM has better clustering results, smaller memory and less time consumption than the traditional fuzzy clustering algorithm has.(4) Explore the issue of possibilistic clustering data stream.In order to overcome the sensitivity of noise and the selection of initial cluster centers of fuzzy clustering methods, in this dissertation, we present a soft clustering algorithm (SWPCM) for data stream based on the possibilistic theory. SWPCM uses a similar strategy as SWFCM dose, that is, it divides the data stream into data chunks and then incrementally clusters the weighted data on each chunk according to the order of its arrival time. SWPCM not only inherits the capability of noise immunity of PCM, but also randomly selects the initial centers, which indicates that SWPCM is not sensitive to the choice of the initial cluster centers. Experiments also show that SWPCM has a strong capability of maintaining the purity and integrity of the natural clusters. Compared to the global scanning methods, SWPCM can save an enormous amount of computing time and memory consumption. In this thesis, we make the effective exploration and attempt to put forward some innovative solutions of several key issues on clustering data stream. Theoretical analysis and experimental results show that our proposed algorithms can solve the corresponding issues effectively and efficiently, and are great complementarity and improvement to the existing methods.
Keywords/Search Tags:data stream, grid, dissimilarity matrix, space mapping, fuzzy, possibilistic
PDF Full Text Request
Related items