| Unsupervised machine learning is one of the widely used and more developed fields in modern computer science and technology.Its goal is to discover patterns,structures and regularities in unlabeled data in order to help people better understand the data and discover hidden information in the data.One of the clustering algorithms is faced with problems such as parameter selection,similarity measures,and being influenced by the characteristics of the data distribution,as well as the need for algorithms of high complexity in terms of speed enhancement in the context of big data.This paper addresses the above issues by conducting an in-depth study on how to represent and apply hidden features in samples through data structures,with the main contents including:The research objectives of this paper are divided into the following three main areas:(1)A composition method based on local density trends.Firstly,a confluence tree is defined to depict the density trend of the data nodes.Secondly,to realize the data structure of confluence trees,a method for constructing branches based on decreasing path length is presented,where each branch reflects the local density trend.Based on this,a simple sparse graph of local density(GLDT)construction method based on confluence trees is designed using the idea of hierarchical iteration.Finally,the correctness of the method is verified by theoretical proof.(2)A clustering analysis method based on GLDT composition is given.Firstly,the density factor of a confluence tree is given according to its path length,and the smaller it is,the denser the confluence tree is;secondly,a connection method based on breadth-first search of density-nearest neighbor is given,and the connected confluence tree is required to satisfy a value less than a threshold,while the mean value associated with each connected sub graph is determined by experiment and experience;based on this,a clustering method based on GLDT composition is designed,using synthetically data and UCI data sets with different characteristics,the correctness of this clustering method is experimentally verified.This algorithm is parameter insensitive and has advantages in clustering for non-convex data compared to similar algorithms.(3)A hierarchical sampling method based on GLDT composition is given.For each connected sub graph,the root node of its tree structure is regarded as the local maximum density point of the sub graph,so we only sample this root node in the sampling process.By recursively performing the composition operation,sampling only the root node at a time,we can obtain high density points,and the final sampling ratio can safely reach 1%.By combining this method with spectral clustering,the time efficiency is improved significantly and the clustering accuracy of spectral clustering is improved as well. |