Font Size: a A A

Research On Non-IID K-Medoids Clustering Algorithm

Posted on:2020-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2428330575487996Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of technology,data mining has become an important means to help users extract effective information from a large amount of data.At the same time,cluster analysis,which is an important branch of data mining,is receiving more and more attention.K-medoids algorithm is one of the representative algorithms in cluster analysis.It overcomes the shortcomings of K-means algorithm which is sensitive to isolated medoids and has strong robustness.However,the K-medoids algorithm still has defects in some aspects.For example,the similarity measure in the algorithm mostly uses the measure of distance.This approach assumes that the data object and attributes are independently and identically distributed.However,in actual situations,data object and attributes are non-independent and identically distributed.Therefore,the similarity measurement method of the K-medoids algorithm needs to be improved;In addition,the K-medoids algorithm has a large time complexity,the selection of the initial medoids is especially important for the algorithm.In order to improve the clustering effect and operation efficiency of the algorithm,this paper has made the following improvements:The measurement method for the K-medoids algorithm is based on the assumption that the data object and attributes are independent and identically distributed.This paper introduces the nominal coupling similarity calculation method in unsupervised learning.The traditional Euclidean distance calculation similarity method is replaced by the non-independent and identical distribution calculation formula.At the same time,since this formula mainly calculates the frequency based on the attribute value,the numerical data is not sensitive to frequency.Therefore,for the numerical data,before the introduction of the formula,clustering and replacing numerical data by attribute column according to Euclidean distance,and the NI-PAM algorithm is designed to make the clustering effect better.A defect in the random selection method for the initial medoids of the NI-PAM algorithm,this paper uses the neighborhood radius to optimize the choice of the initial medoids.Establish a similarity matrix based on the non-independent and identical distribution similarities between data objects.Traverse the matrix and count the number of other data objects contained in each neighborhood of the data object.Select the object with the most content as the first initial medoid.Then,in the similarity matrix,the similarity between the objects included in the neighborhood radius of the object is zeroed.Re-traversing the matrix,and so on,until k medoids are selected.The optimized algorithm improves the computational efficiency of the NI-PAM algorithm.In the above improvements,the correct rate of the algorithm is improved,and improve the running time of the NI-PAM algorithm by optimizing the initial medoids.However,due to the complexity of the calculation of the introduction formula,the time efficiency needs to be improved.Therefore,this paper re-introduced another numerical data coupling similarity calculation formula proposed.And replace the Pearson correlation coefficient with the Spearman rank correlation coefficient.The N-NI-PAM algorithm is designed accordingly.Experiments show that the correct rate of the algorithm has also been greatly improved,and the running time is greatly reduced.The improved algorithm was verified on the UCI dataset.The experimental results show that the accuracy of the NI-PAM algorithm and the N-NI-PAM algorithm is greatly improved compared with the K-medoids algorithm under the Euclidean distance.And the N-NI-PAM algorithm has better computational efficiency.
Keywords/Search Tags:K-medoids algorithm, Non-IIDness, Similarity measure, Initial medoids, Neighborhood radius
PDF Full Text Request
Related items