Font Size: a A A

Research On Data Stream Clustering Algorithm Based On Sliding Windows And Subspace Partition

Posted on:2011-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2178360302994605Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of communication technology and netwok technology, a new model of data-data stream has come into being. This new model data has many real applications, such as Web clicking stream, telecommucation, weather predication, stock exchange information. The main differences between traditional database and data stream are: (1) data stream is unbounded and can not be stored completely; (2) data are transmitted rapidly and changed timely; (3) data arrivel are continuously and orderly; (4) the objects can be read once or several times.According to the characteristics of data stream, cluster analysis becomes hot studying problem in data mining. Many clustering methods have been proposed recently, and got some achievements. According to the characteristics of data stream, this paper mainly researches on clustering algorithms over data streams.Firstly, research on clustering method over dynamical sliding window. In order to address the data streams with varying speed, we propose an efficient data streams clustering algorithm over dynamic sliding windows, which based on the two-phased framework. In the online component, the novel micro-cluster feature is introduced to store the important statistical information of data streams. Through computing the distances from data points to the center of each micro-cluster, and adjusting the sizes of sliding windows, the corresponding clustering features are maintained dynamically. In the offline component, by employing the mean values of the micro-clusters in online component, we adopts k-means algorithm to generate the final clustering results. Experimental results show that our approach has higher clustering purity and better scalability.Secondly, research on clustering high dimensional data stream based on subspace partition. We propose a fast subspace partition data streams clustering algorithm, which adopts two-phased clustering framework. In the online component, the extension of adjacent unit (E-unit), which has common edge or vertex with dense units, is presented. Moreover, the improved CD-Tree lattice structure is introduced to store the information of non-empty units, maintain the position relationships among units, and keep the affiliation between dense units (D-unit) and E-units. Outdated units which need to be faded are performed by decayed function, so that the corresponding micro-clusters are maintained dynamically. In the offline component, the final clusters are generated according to all the micro-clusters by searching D-units in radius range.Lastly, implement the above two algorithms with language of JAVA. All of our experiments are performed on the real life dataset KDD-CUP-99 and synthetic dataset to execute the algorithms this paper presented. The experimental results show the feasibility and effectiveness of our algorithms.
Keywords/Search Tags:Data mining, Data stream, Clustering, Sliding window, CD-Tree grid
PDF Full Text Request
Related items