Font Size: a A A

Research And Implementation Of Distributed Clustering Algorithm Based On Hadoop Platform

Posted on:2014-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2348330473451166Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the wide application of Internet, the amount of data needed by every walk of life to deal and analyze grow rapidly, and even reaching mass data. Due to physical machines out of memory or low efficiency, the traditional clustering analysis algorithms often cannot meet the needs of processing large data network. The emergence of distributed computing provides an effective way to solve the above problems. But there is no efficient distributed algorithm for structured network. The subjectivity of the initial center selection of distributed k-means clustering algorithm makes the clustering result unstable. The process of initial center selection of improved distributed k-means clustering algorithm is complicated, and its time and space complexity is high.According to above problem, this thesis analyzes clustering algorithm on the research status at home and abroad, studies the basic principle, the advantages and disadvantages of SCAN algorithm, Clique algorithm and distributed k-means algorithm, as well as the characteristics and operation mechanism of distributed file system and distributed framework on Hadoop platform. This thesis put forward two kinds of distributed clustering algorithm: one is a structured distributed clustering algorithm, which combines SCAN algorithm, uses MRC theory to design limited MapReduce round number, utilizes the Map merging technology to control the network traffic; another is a algorithm based on density, which combines distributed k-means algorithm and Clique algorithm, uses distributed Clique algorithm to automatically and quickly determine the clustering number, select the global initial clustering center, and deal well for data set containing noise data.Structured and space networks generated by simulation experiment in the Hadoop cluster is utilized to do experiments. Experimental results show that structured distributed clustering algorithm has good performance, availability and scalability. Experimental results show that the distributed algorithm based on density in classification effect and the efficiency is better than that of distributed k-means algorithm.
Keywords/Search Tags:Hadoop, clustering, SCAN, k-means, Clique
PDF Full Text Request
Related items