Estimation of the number of clusters is a very critical problem in cluster analysis.In cluster analysis,we usually want to divide the data into several categories or clusters,and the data within each cluster have similar features or attributes.However,how to determine the number of clusters is a very difficult problem.The choice of the number of clusters has a very big impact on the clustering results.When the number of clusters is too large,it will lead to over-refinement of clusters and produce over-fitting,making the clustering results too complex to interpret and use.And when the number of clusters is too small,it will lead to too coarse clustering results with serious information loss,which cannot effectively reflect the intrinsic structure and characteristics of the data.Therefore,in order to estimate the number of clusters in a reasonable way,some effective methods are needed.At present,the commonly used methods for estimating the number of clusters include elbow method,contour coefficient and so on.The elbow method is a method based on the intra-cluster variance,which finds an "inflection point" by calculating the intra-cluster variance under different cluster numbers,and the cluster number at the inflection point is considered to be the most suitable cluster number.The contour coefficient is a method based on the intra-cluster similarity and inter-cluster variance,which calculates the contour coefficient of each data point and then averages the contour coefficients of all data points to evaluate the quality of clustering results and the optimal number of clusters.However,these methods are not foolproof,as they all depend on the quality of the clustering results and the distribution of the data.In practical applications,we also need to choose the suitable clustering quantity estimation method according to the specific problem and data characteristics.For example,when dealing with high-dimensional data,the problem of dimensional catastrophe needs to be considered;different users may have different views on the choice of the optimal number of clusters,because their data may have different characteristics and different purposes.Therefore,it is very important to estimate the number of clusters reasonably,which can help us understand the data better,discover patterns and regularities in the data,and provide strong support for subsequent data analysis and applications.In summary,this paper proposes a new estimation method and a system based on user requirements.The contributions of this paper are as follows:(1)A method is proposed for estimating the number of clusters for the clustering problem of high-dimensional large-scale data sets,which combines dimensionality reduction sampling and logarithmic search to determine the optimal number of clusters.The method uses a locally sensitive hashing technique to reduce the dimensionality,and then samples the "buckets" to obtain a subset of data to reduce the computational complexity and improve the efficiency of the algorithm.In the logarithmic search stage,the logarithmic search of the number of clusters can effectively narrow down the search range of the number of clusters,and thus find the optimal number of clusters.(2)A system for estimating the number of clusters based on user requirements is proposed.The system automatically estimates the optimal number of clusters based on the dataset attributes and clustering purposes provided by the user,and visualizes the clustering results.The system is highly practical and scalable,and can meet the needs of different users. |