Font Size: a A A

Research On Spectral Clustering Algorithm Under Big Data

Posted on:2022-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:G P YangFull Text:PDF
GTID:2518306782952589Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important research subject in the field of data mining.In the past few decades,a large number of clustering algorithms have proposed,among which spectral clustering is widely used due to its excellent performance in nonlinearly separable data.In addition,with the continuous development of the Internet,more and more data is generated,forming big data.Therefore,how to apply spectral clustering algorithm to big data and mine useful information has become a very important research topic.However,spectral clustering algorithms have serious scalability problems.Spectral clustering requires extremely high time complexity and extremely high space complexity to calculate and store the corresponding Laplacian graph and perform eigendecomposition on the graph.Although a large number of big data spectral clustering algorithms have greatly reduced the time and space complexity of spectral clustering in the past two decades,in big data problems,they still require a lot of time and memory to complete the clustering,especially when the memory is not enough,the algorithm can not run.This shows that in the big data spectral clustering problem,the computational bottleneck is still an important and unsolved problem.To this end,this paper proposes a vector quantization-based framework for big data spectral custering(VQSC),a simple and efficient lightweight big data spectral clustering framework,which can quickly solve the big data spectral clustering problem in the case of limited resources.The main research contents of this paper can be briefly summarized as follows:(1)This paper proposes an approximate k-means-based fast vector quantization technique.This technique obtains some prototypes by applying k-means to some samples that randomly sampled from the original dataset,and then uses these prototypes to represent the original dataset to complete the spectral clustering operation.(2)This paper proposes an information-enhanced prototype similarity calculation technique.Different from directly calculating the distances between prototypes and using these distances to further generate the similaritites between prototypes,this paper uses the relationships between prototypes and samples to calculate the similarities between prototypes,which makes the constructed prototype Laplacian graph contain the information of the samples.(3)This paper proposes a batch fast clustering technique.Due to the nature of vector quantization,each data only needs to compute the distances to all prototypes and assign them directly to the class of the nearest prototype.Different from the present big data spectral clustering algorithms,VQSC only needs to save the prototypes in the memory,and then read the data in batches to complete the clustering.(4)This paper provides a perturbation analysis for VQSC to analyze the perturbation error of the prototype approximate eigenvectors and the target eigenvectors,where the prototype approximate eigenvectors and the target eigenvector represent the eigenvectors of the prototype Laplacian grpah and the eigenvectors of the Laplacian grpah constructed from the original dataset directly.The experimental results on eight datasets of different sizes show that the VQSC proposed in this paper can quickly complete the task of big data spectral clustering in the case of small memory,which well complements and improves the current research on big data spectral clustering algorithms.In addition,this paper conducts an experimental analysis on the parameters involved in VQSC,which proves that the parameters involved in VQSC have high stability and flexibility,and can be well adapted to big data spectral clustering tasks.
Keywords/Search Tags:Big Data, Scalability, Spectral Clustering
PDF Full Text Request
Related items