Font Size: a A A

Research On Data Clustering Based On Grid

Posted on:2018-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:F L CaiFull Text:PDF
GTID:2348330542487343Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid developm ent of inform ation technology,such as radiofrequency RFID technology,storage technology,hardware and software technology,people can collect a lar ge amount of data which can't be managed and analyzed by tradition al methods in a short time.However,data mining is a powerful technique to analyze and process lar ge data.Compared with the traditional data sets which are stored in memory and can be pr ocessed many times,data stream as a new dynamic structure has become a new research hotspot in recent years.In the field of data mining,grid-based clustering algorithm is widely used in static data sets and data streams because of its perfect charact eristics such as fast pro cessing speed,good scalability,insensitivity to input sequence and noise,can be parallelized and updated incrementally and so on.However,grid-based clustering algorithms often have poor clustering accuracy due to the hard-partitioning of the mesh and the artificially set thresholds,which is disadvantageous for multi-density datasets clustering.In this paper,because of the shortcom ings of clustering algor ithm based on grid,we propose two methods to improve the clustering quality of grid-based multi-density clustering and grid-based data stream clustering,respec tively.Multi-density clustering based on grid similarity and moving cells MCGM is the application of grid clustering algorithm in the field of static data set.In this paper,we propose a grid-similarity function,considering grid density and the distance between the centroids of nei ghbor grids.For those boundary grids that don' t satisfy similarity threshold,we move the grid cell according to its centroid and calculate the number of data in every inters ection of its neighbor grids.Ch oose the one with the m ost data points and the two grid s are divided into the s ame cluster.Therefore,multi-density clusters can be found more efficiently,and the clustering accuracy can be im proved.The data stream clustering algorithm based on grid dual centroids GDCD-Stream is the application of grid clustering algorithm in the field of data stream.In this algorithm,a two-phase processing framework is adopted.This fram ework divides the whole clustering process into two stages: online processing and offline clustering.In the process of online processing,the grid feature vector is designed and updated,and the grid data is described from many aspects such as the density of the data po ints in the gr id and the d ata distribution in the g rid.The adaptive grid threshold is introduced.The thresholds are set according to the rea l-time data distribution in the data sp ace and the of f-line detection clus tering period is calculated acco rding to the adaptive grid threshold value.In the process of off-line clustering,the transition grid cells are dichotomized,thus two sub-grid cells are form ed.Using grid similarity function,each of the sub-grid cells is assigned to a certain cluster.According to the process above,the clustering accuracy can be improved.The simulation results show that the MCGM algorithm can detect multi-density clustering effectively,and it doe s not cause multi-cluster m erging.It has high clustering precision and high practicability.Besides,simulation results show that the proposed algorithm can im prove the accu racy of grid data stream clustering algorithm under the premise of sacrificing m emory and can chang e the grid density thresholds according to the change of data density in the data stream.Due to the strategies m entioned above,the quality of clustering can be improved significantly.
Keywords/Search Tags:Cluster, Grid, Data Stream, Grid Similarity, Muti-density
PDF Full Text Request
Related items