Research On Parallel Clustering Algorithm Based On Map-Reduce

Posted on:2013-10-01

Degree:Master

Type:Thesis

Country:China

Candidate:C S Yu

Full Text:PDF

GTID:2248330395455463

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the swift development of age of information, data is characterized by diverse,massive, heterogeneous and dynamic changing. An embarrassing situation whichwebsite operators often facing is “riching in data but lacking inknowledge”.People urgently need a powerful data analysis tools to find usefulknowledge from the complex and mass data, then discover the relationship and rules init to help people make decisions, research, and bring about enormous valuableinformation. Clustering, being a method of unsupervised leaning, is a commontechnique for statistical data analysis used in many fields， including data mining，machine learning, pattern recognition and image analysis.Map-Reduce is a currently popular distributed computing framework, which isproposed by Google. It separates logic problems from the complex underlyingimplementation details, this model is mainly for mass data processing, compared withtraditional model of parallel computing, Map-Reduce takes care of the details of taskscheduling, partitioning the input data, handling machine failures, and so on, therefore itgreatly simplifies the design of programs.This thesis deeply researched two clustering algorithms: k-means clustering andcanopy-k-means clustering, then designed parallel algorithms based on Map-Reduce.This thesis implemented these two algorithms on Hadoop cluster which was composedof4machines. The result of experiment shows that canopy-k-means based onMap-Reduce has higher accuracy, more convergence than k-means based onMap-Reduce. Both of them have good speedup and scalability.

Keywords/Search Tags:

Hadoop, Map-Reduce, k-means, clustering, distributed computing

PDF Full Text Request

Related items

1	Reach On Map-reduce Application Based On Hadoop
2	Reach On Map-Reduce Application Based On Hadoop
3	A Research And Implementation With Improved K-Means Clustering Algorithm To Image Retrieval System Based On Hadoop Platform
4	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
5	Construct High Performance Text Clustering Systems Based On Map-Reduce
6	Research On Parallelization Of Text Clustering Based On Hadoop
7	Parallel Clustering Algorithm Based On MapReduce
8	Research On Cloud Computing Search Engine Design And Parallelization K-means Clustering Algorithms For Big Data
9	Research And Implementation Of Distributed Clustering Algorithm Based On Hadoop Platform
10	Oneof Text Clustering Algorithm Based On Big Data