With the swift development of age of information, data is characterized by diverse,massive, heterogeneous and dynamic changing. An embarrassing situation whichwebsite operators often facing is “riching in data but lacking inknowledge”.People urgently need a powerful data analysis tools to find usefulknowledge from the complex and mass data, then discover the relationship and rules init to help people make decisions, research, and bring about enormous valuableinformation. Clustering, being a method of unsupervised leaning, is a commontechnique for statistical data analysis used in many fields, including data mining,machine learning, pattern recognition and image analysis.Map-Reduce is a currently popular distributed computing framework, which isproposed by Google. It separates logic problems from the complex underlyingimplementation details, this model is mainly for mass data processing, compared withtraditional model of parallel computing, Map-Reduce takes care of the details of taskscheduling, partitioning the input data, handling machine failures, and so on, therefore itgreatly simplifies the design of programs.This thesis deeply researched two clustering algorithms: k-means clustering andcanopy-k-means clustering, then designed parallel algorithms based on Map-Reduce.This thesis implemented these two algorithms on Hadoop cluster which was composedof4machines. The result of experiment shows that canopy-k-means based onMap-Reduce has higher accuracy, more convergence than k-means based onMap-Reduce. Both of them have good speedup and scalability. |