Font Size: a A A

The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors

Posted on:2021-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiFull Text:PDF
GTID:2518306095975579Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the most important tasks in data mining,clustering has always been highly concerned,and has been widely used in many fields such as business intelligence,image pattern recognition,web search.K-means clustering algorithm is a classic partitioning clustering algorithm widely used in daily life and production practice.However,the K-means algorithm still has the disadvantages that the number of clusters is difficult to determine,it is sensitive to noisy data and initial cluster centers,and the time cost of distance calculation is high when the amount of data is large.In this thesis,the above-mentioned problems of K-means are studied in depth,and corresponding improvement strategies are proposed.The main research contents are as follows:(1)Aiming at the problem that K-means algorithm is sensitive to the initial clustering center,a novel algorithm for initial cluster center selection is given.The algorithm starts from the characteristics that the cluster formed by the clustering should have a larger density within a smaller radius.By calculating the nearest neighbor of each data point,the distance and density of each data point are obtained,and a probability function that can reflect the point be the initial center is constructed to determine the final initial cluster center.Experimental results show that the quality of the initial clustering center obtained by this method is higher than that of similar comparison algorithms.(2)Aiming at the high time cost defect of K-means algorithm when calculating distance,a clustering algorithm based on influence space is given.The algorithm firstly preprocesses the data set by calculating the nearest neighbors and inverse nearest neighbors of each data point,that is,the influence space of the point,to obtain a new data set,and then performs K-means clustering algorithm on the preprocessed new data set,thereby effectively reducing the amount of data participating in the operation and improving the clustering efficiency.(3)It is difficult to accurately distinguish boundary data,especially for boundary spectral data falling between two spectral classes.Here,aiming at the problem that boundary data is difficult to cluster,a new method for boundary data clustering is given.The method combines the high-dimensional and massive characteristics of the spectral data itself,in order to make all dimensions of the spectrum can be processed under the same standard,the spectrum firstly needs to be normalized.In order to reduce the amount of calculation,the amount of data involved in the operation needs to be reduced,and the influence space in(2)can achieve the above purpose.In addition,since the random selection of the initial clustering center of the K-means algorithm will have a certain effect on the final clustering result of the boundary spectrum data,determining the clustering center by the method in(1)will avoid this problem to a certain extent.Finally,the determined clustering center is used to perform K-means clustering on the normalized data set to obtain the final clustering result.The experimental results of applying this method to stellar spectral data show that the method well solves the problem of boundary spectral clustering and verifies the validity of the method,and provides a strong support for further research on the formation and evolution of the universe.
Keywords/Search Tags:Clustering, Data mining, Initial cluster center, Stellar spectrum, Influence space, Boundary data
PDF Full Text Request
Related items