Font Size: a A A

Research On K-means Clustering Algorithm Based On Coresets

Posted on:2022-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:D W XueFull Text:PDF
GTID:2518306569993379Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The core of the clustering analysis in mass data is to improve the execution efficiency and scalability of the algorithm while minimizing the loss of clustering quality.Nowadays optimizing the time complexity of the algorithm or computing in parallel are two main directions for clustering analysis in mass data.However,many current research does not take into account the limitation of the data accesses or the cost in data transmission.Therefore,this paper is to study the k-means clustering algorithm based on the coresets,which could deal with the clustering problem of very large-scale datasets by compressing data through sampling.For a static dataset without weight,we give a construction algorithm of coresets based on a hash function and prove error bounds.Through experiments and analysis,the construction algorithm in this article not only optimizes the time complexity and quantization error but also eliminates the failure probability of constructing coresets compared with the state-of-the-art coreset construction algorithm.For the data stream with positive weights,we first extend the coreset construction algorithm based on the hash function to positive weights and give the error bounds,and then give the merging theorem and single-layer merging algorithm of coresets.We verify the advantage of using centroid instead of random sampling through experiments.Taking into account the size issue of the single-layer coreset,we give a multi-layer merging algorithm by using the coreset tree and demonstrate two update methods.According to experiments,the k-means error generated by coreset tree has obvious relationship with the height of the tree and has no relationship with the width.To optimize the nearest neighbor search in the k-means algorithm,we give the approximate query algorithm of the KD-tree and the k-means optimization algorithm based on KD-tree.Experiments have proved that when the number of KD-trees is equal to the dimension of the dataset,this optimization algorithm has comparable clustering quality with the normal k-means algorithm and reduces the time complexity from O(dnk)to O(dn log k).We extend the data stream from positive weights to arbitrary weights,giving a strict definition of legal data stream and relative scenario description,and then give the construction algorithm and merging algorithm of coresets.Then we present the kmeans clustering algorithm framework,which combine the coresets with KD-tree,for data streams with arbitrary weights.Experiments show that the framework,which has high clustering quality and acceleration effects,could delete data points and solve this deletion problem that many coresets cannot complete.
Keywords/Search Tags:clustering analysis, k-means, coresets, KD-tree, data stream
PDF Full Text Request
Related items