Research On Fast Clustering Algorithm For High Dimensional Big Data

Posted on:2019-09-21

Degree:Master

Type:Thesis

Country:China

Candidate:C H Deng

Full Text:PDF

GTID:2428330545997818

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clustering analysis is one of the fundamental tasks in data mining,computer vision,pattern recognition and machine learning.It plays a key role in many tasks,such as data preprocessing,outliers detection and codebook learning,etc.The aim of clustering is to partition input data samples into groups.Such that samples within the same group are closer to other than the samples from different groups.Since the proposal of k-means in the early 1980s,many clustering algorithms appear one after another.Despite great efforts have been taken,there is still lack of effective solutions,k-means remains the most popular clustering method for its simplicity and versatility.However,k-means still suffers from its low speed efficiency when both k and the scale of input samples are very large.In this thesis,a fast k-means variant is proposed based on k-nearest neighbor graph.We show that fast k-means clustering is achievable with the support of an approx-imate k-NN graph.Specifically,in the k-means iteration,one sample only needs to compare with clusters in which its nearest neighbors reside.The clustering complex-ity is therefore irrelevant to the cluster number.As shown in the paper,hundreds to thousands times speed-up is achieved in particular in the case that both n and k are very large.On the other hand,since the fast k-means is built upon k-means#,it also shows very high clustering quality.Overall,the proposed method shows considerably better trade-off between clustering quality and efficiency over existing solutions.Moreover,the beauty of this algorithm also lies in the design of fast k-NN graph construction process.The k-NN graph is built by calling the clustering algo-rithm itself in an intertwined evolving process.In the process,the k-NN graph and k-means clustering are incrementally optimized.Additionally,inviewing that k-means is unable to detect clusters in arbitrary shape,a clustering method called boundary erosion is proposed.In this approach,the clustering is decomposed into two steps.In the first step,the latent structures in the input samples are unspined by a boundary erosion process.The erosion produces a level set for the samples.In the second step,the clusters are formed by cluster expansion based on the level set with the guidance of a k-nearest neighbor graph.This algorithm is simple,generic and is able to identif:y clusters in arbitrary shapes.The major contribution of this thesis are summarized as follows.Fast k-means based on KNN graph was proposed,which accelerate k-means by hundred times while maintaining high clustering quality.Moreover,the time complexity of k-means has been reduced from O(n�k�d)to O(n�log(n)�d),which means clustering time is irrelevant to k.It takes only 5.2 hours to clustering 10M data with 512-dimensional to 1M clusters.In contrast,to fulfill the same scale of clustering,it would take 3 years for traditional k-means.� A fast KNN graph construction method based on clustering was proposed,which is used to support fast k-means that proposed by this thesis.� A new clustering method was proposed,namely clustering via boundary erosion,which is able to discover arbitrary shape.As demonstrated across various clustering tasks,it is able to outperform most of the state-of-the-art algorithms and its performance is nearly perfect in some scenarios.Moreover,it is very fast in large scale.

Keywords/Search Tags:

clustering, k-means, k-NN graph

PDF Full Text Request

Related items

1	Research On The K-means Clustering Algorithm Based On Attribute Augumented Graph
2	Robust Adaptive Graph-Regularization Clustering Algorithm
3	A Deep Embedding Clustering Algorithm Considering Preservation Of Initial Clustering Structure And Its Application
4	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel
5	Research On Fast Clustering Algorithm For High Dimensional Big Data
6	Research On Multidimensional Data And Software Clustering Based On Graph Clustering
7	Based On K-means The Chinese Text Clustering Algorithm
8	Research And Application Of Clustering Algorithm Based On Knowledge Graph In Text Clustering
9	Study Of Chinese Text Clustering On Improved K-means Algorithm
10	The Application Of Fuzzy C-means Clustering In The Stock Investment