Font Size: a A A

Based On Sliding Window And The Grid Density Data Stream Clustering Algorithm Research

Posted on:2013-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y J OuFull Text:PDF
GTID:2248330374488295Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent decades, the problem of data stream clustering has attracted much attention and has been extensively studied by many international domestic academics. However, there are still many issues unsolved, such as the problem of efficiency of algorithm, storage space and precision of clustering. Based on the research of common clustering algorithms for data streams and the analysis of advantages and disadvantages of the algorithms, this paper proposes optimized data stream clustering algorithms.A majority of data stream is characterized by high-dimensional, it is therefore inefficient to cluster the crude data streams due to its inherently complexity. Besides, the data stream has the characteristic of evolution and people tend to be more concerned about the recent data. This paper proposes an effective clustering algorithm referred as HSWStream for high dimensional data stream over sliding window. First, the algorithm uses the technique of projected clustering to descend dimension, with the efficiency improved. Second, the algorithm eliminates the influence of historical data with the technology of sliding window and maintains the summary information of clusters with exponential histogram of cluster feature. Finally, in order to bring more efficiency, the method of maintaining exponential histogram has been improved. The experimental results demonstrate that HSWStream has higher quality, less memory and better scalability than HPStream.Most of the existing grid-based data stream clustering algorithms are able to deal with the clusters with arbitrary shape, but deficiencies still exist. Actually, the data space is partitioned by many grids. Some grids on cluster edge are often regarded as sporadic ones because they contain less data points. As a result, the accuracy is reduced due to the incomplete information on cluster edge. This paper proposes a data stream clustering algorithm based on grid density and force called F-Stream. Using the CluStream double-layer framework, this algorithm divides the clustering process into online component and offline component. The online component maps new data points to the corresponding grids and updates the characteristic vectors of grids, while the offline component clusters the grids. The algorithm adopts a cluster boundary treatment technique, which is based on grid force, to improve the accuracy of clustering and make the cluster edge smoother. To judge whether the grid on cluster edge is a sporadic grid, the algorithm introduces both the distance and the density of grid as the judgment standard. The computation of grid centroid uses weighted mean instead of arithmetic mean. Consequently, the position of grid centroid can represent the space distribution of data points indirectly. Moreover, the incremental updating of the centroid saves a lot of time and space. The experimental results indicate that the F-Stream has higher clustering quality and speed compared with the CluStream. Due to the implement of boundary treatment technique, it avoids the loss of cluster boundary information and has higher accuracy than D-Stream.
Keywords/Search Tags:data stream, clustering, sliding window, grid density, gridforce
PDF Full Text Request
Related items