Font Size: a A A

Research On Parallel Clustering Algorithm Based On Map-Reduce

Posted on:2013-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:C S YuFull Text:PDF
GTID:2248330395455463Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the swift development of age of information, data is characterized by diverse,massive, heterogeneous and dynamic changing. An embarrassing situation whichwebsite operators often facing is “riching in data but lacking inknowledge”.People urgently need a powerful data analysis tools to find usefulknowledge from the complex and mass data, then discover the relationship and rules init to help people make decisions, research, and bring about enormous valuableinformation. Clustering, being a method of unsupervised leaning, is a commontechnique for statistical data analysis used in many fields, including data mining,machine learning, pattern recognition and image analysis.Map-Reduce is a currently popular distributed computing framework, which isproposed by Google. It separates logic problems from the complex underlyingimplementation details, this model is mainly for mass data processing, compared withtraditional model of parallel computing, Map-Reduce takes care of the details of taskscheduling, partitioning the input data, handling machine failures, and so on, therefore itgreatly simplifies the design of programs.This thesis deeply researched two clustering algorithms: k-means clustering andcanopy-k-means clustering, then designed parallel algorithms based on Map-Reduce.This thesis implemented these two algorithms on Hadoop cluster which was composedof4machines. The result of experiment shows that canopy-k-means based onMap-Reduce has higher accuracy, more convergence than k-means based onMap-Reduce. Both of them have good speedup and scalability.
Keywords/Search Tags:Hadoop, Map-Reduce, k-means, clustering, distributed computing
PDF Full Text Request
Related items