Research On K-means Clustering Algorithm Based On Coresets

Posted on:2022-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:D W Xue

Full Text:PDF

GTID:2518306569993379

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The core of the clustering analysis in mass data is to improve the execution efficiency and scalability of the algorithm while minimizing the loss of clustering quality.Nowadays optimizing the time complexity of the algorithm or computing in parallel are two main directions for clustering analysis in mass data.However,many current research does not take into account the limitation of the data accesses or the cost in data transmission.Therefore,this paper is to study the k-means clustering algorithm based on the coresets,which could deal with the clustering problem of very large-scale datasets by compressing data through sampling.For a static dataset without weight,we give a construction algorithm of coresets based on a hash function and prove error bounds.Through experiments and analysis,the construction algorithm in this article not only optimizes the time complexity and quantization error but also eliminates the failure probability of constructing coresets compared with the state-of-the-art coreset construction algorithm.For the data stream with positive weights,we first extend the coreset construction algorithm based on the hash function to positive weights and give the error bounds,and then give the merging theorem and single-layer merging algorithm of coresets.We verify the advantage of using centroid instead of random sampling through experiments.Taking into account the size issue of the single-layer coreset,we give a multi-layer merging algorithm by using the coreset tree and demonstrate two update methods.According to experiments,the k-means error generated by coreset tree has obvious relationship with the height of the tree and has no relationship with the width.To optimize the nearest neighbor search in the k-means algorithm,we give the approximate query algorithm of the KD-tree and the k-means optimization algorithm based on KD-tree.Experiments have proved that when the number of KD-trees is equal to the dimension of the dataset,this optimization algorithm has comparable clustering quality with the normal k-means algorithm and reduces the time complexity from O(dnk)to O(dn log k).We extend the data stream from positive weights to arbitrary weights,giving a strict definition of legal data stream and relative scenario description,and then give the construction algorithm and merging algorithm of coresets.Then we present the kmeans clustering algorithm framework,which combine the coresets with KD-tree,for data streams with arbitrary weights.Experiments show that the framework,which has high clustering quality and acceleration effects,could delete data points and solve this deletion problem that many coresets cannot complete.

Keywords/Search Tags:

clustering analysis, k-means, coresets, KD-tree, data stream

PDF Full Text Request

Related items

1	The Research Of Grid-based Parallel Clustering Algorithm And Clustering For Data Stream
2	A Density-Based Clustering Algorithm Over Stream Data
3	Research On Grid-based MST Data Stream Clustering Algorithm
4	Study Of Real Time Data Stream Clustering Based On Damped Window And Pruning Dimension Tree
5	Analysis And Optimization Of Fuzzy Clustering Algorithm Based On Data Stream
6	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
7	Research On Fuzzy Clustering Algorithm For Data Stream
8	Data Stream Clustering Algorithm Analysis Using A Flock Of Agents
9	Research On Clustering DLIS-R Tree Algorithm Based On Spatial Data
10	Research On Data Stream Clustering Algorithm Based On Sliding Windows And Subspace Partition