Font Size: a A A

Research On High Dimensional Data Clustering Algorithm Based On Subspace And Density Peak

Posted on:2019-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:W TanFull Text:PDF
GTID:2428330572995097Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,the explosive growth of data makes it more and more difficult to find valuable information.The original method can achieve good clustering effect on low-dimensional data sets,due to the impact "dimension disaster",the same can not get the desired results in the high-dimensional data sets.Therefore,it is extremely urgent to find more comprehensive clustering methods.This paper mainly studies the clustering algorithms applicable to high-dimensional data sets.First,the background and significance of high-dimensional data clustering in data mining,and the development status of clustering algorithms at home and abroad are introduced.Then,the related knowledge of clustering is introduced.Based on reading a large amount of literature,some improvements of existing algorithms are proposed.The main work is as follows:(1)The advantages and disadvantages of existing clustering algorithms were summarized,especially the CLIQUE algorithm and DPC algorithm.The equal-width meshing in the CLIQUE algorithm may lose some of the clustering points and destroy the integrity of dense areas;and the artificial input density threshold is random,so it is difficult to determine the appropriate threshold.DPC algorithm can only deal with small and medium data sets,and can't distinguish outliers and cluster boundary points.(2)An adaptive high-dimensional subspace clustering algorithm REG-CLIQUE was proposed.A binary tree was combined with relative entropy to perform adaptive meshing,remove the redundant dimension,and improve the clustering accuracy.The formula of the density threshold was proposed,and a suitable value was recursively obtained,which greatly reduced the priori of the algorithm.Results showed that REG-CLIQUE algorithm can achieve adaptive clustering,and the clustering time and accuracy are better than GP-CLIQUE algorithm and CLIQUE algorithm.(3)An improved density peak clustering algorithm SREDPC was proposed.Sampling high-dimensional large data sets.Residual squares was used to provide a better decision graph than the DPC algorithm to determine cluster center;the outliers and the boundary points belonging to the cluster clusters are distinguished by the halo recognition.Results showed that the improved algorithm can be applied to high-dimensional large data sets,and it is also superior to the original DPC algorithm in both time complexity and clustering results.
Keywords/Search Tags:Big data, Clustering, Subspace, Adaptive, Density peak
PDF Full Text Request
Related items