Efficient And High Quality Flat Clustering Algorithms

Posted on:2021-08-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Ren

Full Text:PDF

GTID:2518306095451524

Subject:Software engineering

Abstract/Summary:

As a crucial data mining technique,clustering has been studied widely over the past few decades.The unsupervised nature of clustering is favored by researchers in the era of big data.Among all the clustering problems,the k-means problem and spectral clustering problem could be the two most well-known problems.However,both clustering problems are NP-hard.Hence,we could only use approximate algorithms on big data.How to improve the clustering quality and speed up the clustering procedure are two key problems of k-means and spectral clustering.This thesis will investigate these two problems on the k-means and spectral clustering in different chapters in the following ways.First,the thesis will define the k-means or spectral clustering in the corresponding chapters and introduce relevant backgrounds.Second,techniques for improving the clustering quality and speeding up will be introduced separately and described in details.Third,our contribution and innovation will be described and emphasized.After that,experiments will be carried out to verify our algorithms and theoretical results.Finally,the main results will be summarized and the future directions will be forecasted.Based on my research on the algorithms and theories of k-means and spectral clustering,relevant algorithms and theories are further improved.My major contributions are as follows:1.A sharper bound for the uniform-sampling based k-means algorithm is proved,and a further proof indicates that this algorithm runs in polylogarithmic time given mild assumptions on datasets.2.The classic k-means++algorithm has been extended to weighted k-means problem and proofs on the clustering quality are given.3.A sharper bound for weighted kernel k-means based spectral clustering algorithm has been proved.4.MATLAB implementations of uniform sampling,K-MC2,Double-K-MC2 and their corresponding kernel versions on the k-means problem are given and experiments validate the efficiency and effectiveness of these algorithms.5.MATLAB implementations of weighted kernel k-means based spectral clustering and its uniform sampling versions are given and experiments illustrate the efficiency and the reasonable clustering quality of this sampling version.Our work not only proves the clustering quality of the uniform-sampling based clustering algorithm but also validates the quality-efficiency balance of this algorithm with experiments.Compared with seeding algorithms,this algorithm boosts the clustering quality without spending too much running time.Compared with original no-sampling clustering algorithms,this algorithm speeds up the original algorithms by an order of magnitude while loses only a little clustering quality.In the future,the algorithm can be applied widely on big data analysis,e.g.de-duplication of databases,community detection,and text mining on big data,etc.

Keywords/Search Tags:

k-means, sampling, theoretical guarantee, spectral clustering, Nystr?m method

Related items

1	Research On Spectral Clustering Algorithm Based On Weighted Ensemble Nystr?m Sampling
2	Large Scale Spectral Clustering Based On Nystr(?)m Extension
3	Research On Semi-supervised Spectral Clustering Algorithm Based On The Optimal Projection
4	Research On Improved Spectral Clustering Algorithm In The Application Of Image Segmentation
5	Segmentation Of Image-based Spectral Clustering Method
6	A Study Of Nystr(?)m Extensions Based On Cyclic Matrix Projections
7	The Application And Research Of Image Segmentation Based On Spectral Method
8	Research On Preference Extraction Based On Nystr?m Method
9	Spectral Clustering And Dictionary Learning Based Image Segmentation
10	Research Of SAR Image Segmentation Based On Improved Spectral Clustering Algorithm