Font Size: a A A

Parallel Clustering Algorithm Based On MapReduce

Posted on:2012-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:C WenFull Text:PDF
GTID:2178330332976267Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The scale of Internet data is massive and is growing rapidly. Extracting the valuable information from the vast amounts of data has been a hot spot in the research of information technique. Clustering, being a method of unsupervised learning, is a common technique for statistical data analysis used in many fields, including data mining, machine learning, pattern recognition and image analysis. The traditional serial type of clustering algorithms has two problems, and is not capable enough to satisfy the actual application needs. The first problem is the speed of clustering is not high enough ("efficiency"). And the second problem is that, when processing with big data, the traditional clustering even can't work, because of the memory capacity constraints ("effective").MapReduce is the currently popular distributed computing framework, which is proposed by Google. This thesis researched two clustering algorithms by MapReduce: parallel spectral clustering, and parallel AP clustering. This thesis implemented these two algorithms on Hadoop cluster which was composed of 10 machines.Parallel spectral clustering is to segment the similarity matrix computation and sparse computation by data point index; When calculating the eigenvectors, I put the Laplacian matrix on the Hadoop distributed file system(HDFS), and launch distributed Lanczos operations to get the eigenvectors of Laplacian matrix; Finally, make parallel K-means clustering on eigenvector's transposed matrix to get the final clustering results. Based on each step of the algorithm parallel strategy, the whole algorithm obtains nearly linear speedup.Parallel AP clustering is first to store responsibility matrix and availability matrix in HBase, and segment the two matrixes computation by row index in each iteration, make these two matrixes computation distributed on multiple machines. The strategy makes algorithm computation linear speedup with the increase of machines.Through an empirical study on the Corel image dataset, I analyzed and compared their performances and clustering effects on MapReduce framework. Therefore, this thesis can provide solution ideas to the above mentioned clustering problems ("efficiency" and "effective").
Keywords/Search Tags:Distributed computing, Parallel clustering, AP clustering, Affinity Propagation, Spectral clustering, Hadoop, MapReduce, K-means clustering, HDFS, HBase
PDF Full Text Request
Related items