The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors

Posted on:2021-12-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y T Li

Full Text:PDF

GTID:2518306095975579

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As one of the most important tasks in data mining,clustering has always been highly concerned,and has been widely used in many fields such as business intelligence,image pattern recognition,web search.K-means clustering algorithm is a classic partitioning clustering algorithm widely used in daily life and production practice.However,the K-means algorithm still has the disadvantages that the number of clusters is difficult to determine,it is sensitive to noisy data and initial cluster centers,and the time cost of distance calculation is high when the amount of data is large.In this thesis,the above-mentioned problems of K-means are studied in depth,and corresponding improvement strategies are proposed.The main research contents are as follows:(1)Aiming at the problem that K-means algorithm is sensitive to the initial clustering center,a novel algorithm for initial cluster center selection is given.The algorithm starts from the characteristics that the cluster formed by the clustering should have a larger density within a smaller radius.By calculating the nearest neighbor of each data point,the distance and density of each data point are obtained,and a probability function that can reflect the point be the initial center is constructed to determine the final initial cluster center.Experimental results show that the quality of the initial clustering center obtained by this method is higher than that of similar comparison algorithms.(2)Aiming at the high time cost defect of K-means algorithm when calculating distance,a clustering algorithm based on influence space is given.The algorithm firstly preprocesses the data set by calculating the nearest neighbors and inverse nearest neighbors of each data point,that is,the influence space of the point,to obtain a new data set,and then performs K-means clustering algorithm on the preprocessed new data set,thereby effectively reducing the amount of data participating in the operation and improving the clustering efficiency.(3)It is difficult to accurately distinguish boundary data,especially for boundary spectral data falling between two spectral classes.Here,aiming at the problem that boundary data is difficult to cluster,a new method for boundary data clustering is given.The method combines the high-dimensional and massive characteristics of the spectral data itself,in order to make all dimensions of the spectrum can be processed under the same standard,the spectrum firstly needs to be normalized.In order to reduce the amount of calculation,the amount of data involved in the operation needs to be reduced,and the influence space in(2)can achieve the above purpose.In addition,since the random selection of the initial clustering center of the K-means algorithm will have a certain effect on the final clustering result of the boundary spectrum data,determining the clustering center by the method in(1)will avoid this problem to a certain extent.Finally,the determined clustering center is used to perform K-means clustering on the normalized data set to obtain the final clustering result.The experimental results of applying this method to stellar spectral data show that the method well solves the problem of boundary spectral clustering and verifies the validity of the method,and provides a strong support for further research on the formation and evolution of the universe.

Keywords/Search Tags:

Clustering, Data mining, Initial cluster center, Stellar spectrum, Influence space, Boundary data

PDF Full Text Request

Related items

1	Research And Application Of Clustering Method Based On Two Curve Chebyshev Approximation
2	The Research And Application Of The K-means Clustering Algorithm Based On Influence Space
3	Algorithms Implementation Of Determining The Number Of Clusters And Initial Cluster Centers For Mixed Data
4	Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data
5	Design And Implementation Of Initial Cluster Center Selection Algorithm For Categorical Matrix-object Data
6	Research On Advertisement Recommendation System Based On Data Mining
7	Study And Application Of CRM Data Mining Based On Clustering Algorithms
8	Research On Cluster Boundary Detecting Technology For Categorical Data
9	Clustering Method And Application Of Stellar Spectrum Based On Attribute Weighting
10	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm