Research And Implementation Of Distributed Clustering Algorithm Based On Hadoop Platform

Posted on:2014-09-11

Degree:Master

Type:Thesis

Country:China

Candidate:J Liu

Full Text:PDF

GTID:2348330473451166

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the wide application of Internet, the amount of data needed by every walk of life to deal and analyze grow rapidly, and even reaching mass data. Due to physical machines out of memory or low efficiency, the traditional clustering analysis algorithms often cannot meet the needs of processing large data network. The emergence of distributed computing provides an effective way to solve the above problems. But there is no efficient distributed algorithm for structured network. The subjectivity of the initial center selection of distributed k-means clustering algorithm makes the clustering result unstable. The process of initial center selection of improved distributed k-means clustering algorithm is complicated, and its time and space complexity is high.According to above problem, this thesis analyzes clustering algorithm on the research status at home and abroad, studies the basic principle, the advantages and disadvantages of SCAN algorithm, Clique algorithm and distributed k-means algorithm, as well as the characteristics and operation mechanism of distributed file system and distributed framework on Hadoop platform. This thesis put forward two kinds of distributed clustering algorithm: one is a structured distributed clustering algorithm, which combines SCAN algorithm, uses MRC theory to design limited MapReduce round number, utilizes the Map merging technology to control the network traffic; another is a algorithm based on density, which combines distributed k-means algorithm and Clique algorithm, uses distributed Clique algorithm to automatically and quickly determine the clustering number, select the global initial clustering center, and deal well for data set containing noise data.Structured and space networks generated by simulation experiment in the Hadoop cluster is utilized to do experiments. Experimental results show that structured distributed clustering algorithm has good performance, availability and scalability. Experimental results show that the distributed algorithm based on density in classification effect and the efficiency is better than that of distributed k-means algorithm.

Keywords/Search Tags:

Hadoop, clustering, SCAN, k-means, Clique

PDF Full Text Request

Related items

1	The Research On The Improvement And Parallelization Of CLIQUE Algorithm In Hadoop Environment
2	Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment
3	Research And Application Of Clustering In Telecom Customer Differentiated Reminder Based On Hadoop
4	Research And Implementation Of Clustering Algorithm Based On Hadoop Platform
5	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
6	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
7	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
8	A Research And Implementation With Improved K-Means Clustering Algorithm To Image Retrieval System Based On Hadoop Platform
9	Research On Parallel Clustering Algorithm On Hadoop Platform
10	Clustering Analysis Based On Hadoop