Parallel Clustering Algorithm Based On MapReduce

Posted on:2012-11-13

Degree:Master

Type:Thesis

Country:China

Candidate:C Wen

Full Text:PDF

GTID:2178330332976267

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The scale of Internet data is massive and is growing rapidly. Extracting the valuable information from the vast amounts of data has been a hot spot in the research of information technique. Clustering, being a method of unsupervised learning, is a common technique for statistical data analysis used in many fields, including data mining, machine learning, pattern recognition and image analysis. The traditional serial type of clustering algorithms has two problems, and is not capable enough to satisfy the actual application needs. The first problem is the speed of clustering is not high enough ("efficiency"). And the second problem is that, when processing with big data, the traditional clustering even can't work, because of the memory capacity constraints ("effective").MapReduce is the currently popular distributed computing framework, which is proposed by Google. This thesis researched two clustering algorithms by MapReduce: parallel spectral clustering, and parallel AP clustering. This thesis implemented these two algorithms on Hadoop cluster which was composed of 10 machines.Parallel spectral clustering is to segment the similarity matrix computation and sparse computation by data point index; When calculating the eigenvectors, I put the Laplacian matrix on the Hadoop distributed file system(HDFS), and launch distributed Lanczos operations to get the eigenvectors of Laplacian matrix; Finally, make parallel K-means clustering on eigenvector's transposed matrix to get the final clustering results. Based on each step of the algorithm parallel strategy, the whole algorithm obtains nearly linear speedup.Parallel AP clustering is first to store responsibility matrix and availability matrix in HBase, and segment the two matrixes computation by row index in each iteration, make these two matrixes computation distributed on multiple machines. The strategy makes algorithm computation linear speedup with the increase of machines.Through an empirical study on the Corel image dataset, I analyzed and compared their performances and clustering effects on MapReduce framework. Therefore, this thesis can provide solution ideas to the above mentioned clustering problems ("efficiency" and "effective").

Keywords/Search Tags:

Distributed computing, Parallel clustering, AP clustering, Affinity Propagation, Spectral clustering, Hadoop, MapReduce, K-means clustering, HDFS, HBase

PDF Full Text Request

Related items

1	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
2	Spectral Clustering Method Based On Affinity Propagation For Vessel Trajectory Clustering
3	Research And Application Of Clustering Parallel Strategy For Affinity Propagation
4	Research And Application Of Parallelizing Affinity Propagation Clustering Based On Distributed Computing
5	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
7	Research Of Spectral Clustering
8	Accurate Marketing Of Credit Card Customers Based On AP Clustering Algorithm
9	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
10	The Clustering Algorithm Based On Hadoop Parallel Analysis And Applied Research