Font Size: a A A

Parallel Spectral Clustering Algorithm Based On Hadoop

Posted on:2013-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z C LiFull Text:PDF
GTID:2248330395975443Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Spectral clustering and cloud computing are both the new branch of computer-relateddisciplines. Spectral clustering has the spectral graph theory and matrix computations as itstheoretical principle, it overcomes some shortcomings of the traditional clustering algorithm,and guarantees to converge to the optimal solution, and thereby it has been paid greatattention widely. This kind of clustering method is by using the feature vector of datasimilarity matrix to cluster, which has obvious advantages comparing with other kind ofclustering: basing on spectral graph theory, and has a solid theory foundation; the theory issimple, and easy to implement. It works very well when dealing with small sample data sets,as for large-scale data sets, the computation amount and computational complexity increase ina geometrical progression, storage cannot take this amount; in this case, spectral clustering isrestricted when used in practical application. How to cluster large-scale data set by usingspectral clustering, with the development of cloud computing related technologies, Hadoopcome into our vision, it is a mature open source framework. After combined with Hadoopcloud computing framework, spectral clustering opens a breach for dealing with large-scaledata.Hadoop is an open and mature computing framework which is widely-used. It consists of adistributed storage system HDFS and parallel computing framework MapReduce as its twocore subsystems, while HBase is a distributed database built on top of HDFS. With the help ofHadoop, spectral clustering can have the capability from single serial processing to clusterparallel processing. The parallelization analysis the process of spectral clustering.Theparallelization analyzes the serial implementation process of spectral clustering algorithm,then extract three more time-consuming parts which can be parallelized, parallelize themseparately, thereby parallel spectral clustering algorithm basing on Hadoop is accomplished.This paper first introduces the research background and significance of parallel spectralclustering algorithm, second makes a detailed description of the cloud computing frameworkHadoop, and introduces spectral clustering and its related, then explains parallelization ofspectral clustering and the related steps. At last, some related experiments are carried out andexperimental summary been report.
Keywords/Search Tags:spectral clustering, cloud computing, Hadoop, parallelization
PDF Full Text Request
Related items