Font Size: a A A

Research On Data Stream Clustering Algorithm Based On Density Grid

Posted on:2012-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y MiFull Text:PDF
GTID:2218330338467517Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Data mining means to extract or "mining" knowledge from large amounts of data. More specifically, obtain essential characteristics and universal laws which behind the data by analysis the data. As a very important data mining method, clustering has been widely used in various fields. Clustering is a process which divides the collection of physical or abstract objects to different object classes using some kinds of similarity criterion, objects which in the same class have some similarities. By clustering, the correlation between global distribution model of the data and object properties can be found, it is interesting.In recent years, with the development of computer and communications technology, a large amount of data stream is generated among the industries. This kind of data has the following features:high flow speed, unlimited number of data, changes dynamically, unpredictable. All these features limit the clustering on data stream. Many scholars have done a lot of research on clustering data stream, but there are still many outstanding areas for improvement.Clustering method based on grid and density has many special advantage compared with other method, for example, high computing speed, finding clusters with arbitrary shape, these characteristics are suitable for clustering on data stream. Density threshold is a crucial parameter to clustering algorithm based on grid and density, which affects the quality of the algorithm significantly. However, general user's lack of domain knowledge and prior information about the data can hardly determine the parameter. In this thesis, the method of average density is used to determine the grid density threshold, through the analysis on grid density of initial data distribution. In data stream processing, the density threshold is adjusted dynamically to adapt to the characteristic that data stream changes dynamically. A common problem in grid-based clustering method is that it is difficult to find the cluster boundary precisely. The reason is the original information about data is discarded and operation only on grids in grid-based method. To improve the accuracy of cluster boundary, store the information of data moderately and subdivide the grids in the boundary. In most grid-based clustering algorithm, the process of cluster formation use random sequence generation, produces a large number of small cluster, it dose not make sense. To solve this problem, choose the grid unit which has highest density as the starting point to form cluster, this helps to find the original structure of the cluster.On basis of previous research, a data stream clustering algorithm is proposed which based on improving the D-Stream algorithm. Result of experiments on artificial and real data demonstrates that our algorithm got good clustering quality.
Keywords/Search Tags:Data Mining, Clustering Analysis, Data Stream, Density Grid, Non-uniform Division Grid
PDF Full Text Request
Related items