Clustering is one of the important means of data mining,which can classify the unlabeled data into different clusters by unsupervised learning.Grid clustering taking grid cells as processing units can effectively achieve fast clustering of large datasets,and its clustering time complexity is independent of the number of data points.In recent years,it is widely concerned.However,with the increasing in data volume and data dimensionality,the number of non-empty grids increase sharply which results in the efficiency decreasing of grid clustering.Additionally,in reality,data often exhibits crossed clusters,and the traditional grid clustering algorithms adopting connectivity clustering are prone to incorrectly classify multiple crossed clusters into the same cluster,and result in the decrease of clustering accuracy.The density-based clustering methods consider clusters as a collection of the connected high-density data points,and have a clear advantage in the selection of core points and core positions.Therefore,distinguishing the logical positions of data points in space based on different densities and overcoming the "grid disaster" problem caused by the increasing of the number of grids and the uncertainty of boundaries are important issues.So,it is of great research value to research the density based grid clustering method which is suitable for large-scale datasets with crossed clusters.Meanwhile,exploring its practical industry application is of great significance.This dissertation focuses on the clustering requirements of large-scale data with crossed clusters.The grid clustering methods based on density differentiation for large-scale data with crossed clusters are deeply studied,and then the application of the proposed clustering algorithm in the process of methanol distillation is explored.The main work of this dissertation are as follows:(1)Grid density peaks clustering algorithm based on the Zipf distribution.Grid density peak clustering,taking advantage of the strong ability that the density peaks clustering algorithm can identify arbitrary shape clusters,greatly reduces the computational cost by gridding the data set.However,in large-scale datasets,the distance matrix calculation for non-empty grids is expensive and the space-time complexity is high.In view of the probability density distribution with grid density as variable shows a Zipf-like distribution,a grid density peak clustering algorithm based on Zipf distribution is proposed.According to the Zipf distribution,the dense center grids are filtered and the cluster centers are determined heuristically.The experimental results show that the proposed algorithm has significant advantages on clustering of large-scale and crossed-cluster data.(2)Grid DBSCAN clustering algorithm based on dynamic bitmap indexing and shared neighbor.The grid DBSCAN algorithm based on cluster forest,in which the bitmap-like indexing was used for fast range querying of neighboring grids,have advantages in on high-dimensional data clustering.However,the process of grid indexing and merging have a significant amount of redundancy,and the low-density-first strategy used in the grid merging may lead to incorrectly number of clusters on the large data sets with crossed clusters.In order to overcome the problems of clustering errors caused by the density parameters,a dynamic grid indexing and high-density-first strategy are introduced,based on the bitmap-like indexing.Grid DBSCAN clustering algorithm based on dynamic bitmap indexing and shared neighbor is proposed.The clustering efficiency of high-dimensional large datasets with crossed clusters is improved.(3)The application of grid density peaks clustering algorithm based on Zipf distribution in the process of methanol distillation.The determination of the optimal parameters in the process of methanol distillation affects the quality of methanol distillation and the production cost of the enterprise.The sensitivity analysis based on greedy strategy used in traditional parameter optimization can easily generate local optimal solution.Therefore,the grid density peaks clustering algorithm based on the Zipf distribution is applied in the data analysis of methanol distillation process.The key parameters are discovered by the comparative analysis of different clustering results,and then,some decision-making suggestions for optimizing the methanol distillation process are given. |