Font Size: a A A

Research On Fast Clustering Algorithm For High Dimensional Big Data

Posted on:2019-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:C H DengFull Text:PDF
GTID:2428330545997818Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of the fundamental tasks in data mining,computer vision,pattern recognition and machine learning.It plays a key role in many tasks,such as data preprocessing,outliers detection and codebook learning,etc.The aim of clustering is to partition input data samples into groups.Such that samples within the same group are closer to other than the samples from different groups.Since the proposal of k-means in the early 1980s,many clustering algorithms appear one after another.Despite great efforts have been taken,there is still lack of effective solutions,k-means remains the most popular clustering method for its simplicity and versatility.However,k-means still suffers from its low speed efficiency when both k and the scale of input samples are very large.In this thesis,a fast k-means variant is proposed based on k-nearest neighbor graph.We show that fast k-means clustering is achievable with the support of an approx-imate k-NN graph.Specifically,in the k-means iteration,one sample only needs to compare with clusters in which its nearest neighbors reside.The clustering complex-ity is therefore irrelevant to the cluster number.As shown in the paper,hundreds to thousands times speed-up is achieved in particular in the case that both n and k are very large.On the other hand,since the fast k-means is built upon k-means#,it also shows very high clustering quality.Overall,the proposed method shows considerably better trade-off between clustering quality and efficiency over existing solutions.Moreover,the beauty of this algorithm also lies in the design of fast k-NN graph construction process.The k-NN graph is built by calling the clustering algo-rithm itself in an intertwined evolving process.In the process,the k-NN graph and k-means clustering are incrementally optimized.Additionally,inviewing that k-means is unable to detect clusters in arbitrary shape,a clustering method called boundary erosion is proposed.In this approach,the clustering is decomposed into two steps.In the first step,the latent structures in the input samples are unspined by a boundary erosion process.The erosion produces a level set for the samples.In the second step,the clusters are formed by cluster expansion based on the level set with the guidance of a k-nearest neighbor graph.This algorithm is simple,generic and is able to identif:y clusters in arbitrary shapes.The major contribution of this thesis are summarized as follows.Fast k-means based on KNN graph was proposed,which accelerate k-means by hundred times while maintaining high clustering quality.Moreover,the time complexity of k-means has been reduced from O(n·k·d)to O(n·log(n)·d),which means clustering time is irrelevant to k.It takes only 5.2 hours to clustering 10M data with 512-dimensional to 1M clusters.In contrast,to fulfill the same scale of clustering,it would take 3 years for traditional k-means.· A fast KNN graph construction method based on clustering was proposed,which is used to support fast k-means that proposed by this thesis.· A new clustering method was proposed,namely clustering via boundary erosion,which is able to discover arbitrary shape.As demonstrated across various clustering tasks,it is able to outperform most of the state-of-the-art algorithms and its performance is nearly perfect in some scenarios.Moreover,it is very fast in large scale.
Keywords/Search Tags:clustering, k-means, k-NN graph
PDF Full Text Request
Related items