Font Size: a A A

Research On Parallel Clustering Algorithm On Hadoop Platform

Posted on:2019-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y X GuoFull Text:PDF
GTID:2428330575973659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Clustering algorithm plays a vital rolein data mining technology which is often used to measure and analysis the similarity between different objects in data source and can be used as preprocessing steps in other algorithms in data mining.The Hadoop platform can allocate the entire computing task to multiple computers on the resource pool,and it has the ability to efficiently process massive amounts of data.The clustering algorithm combined with Hadoop is beneficial to realize or improve the ability of the original algorithm to deal with massive data.This article in view of the traditional K-means algorithm by random initial cluster center is easy to lead to poor clustering effect of faults,this paper proposes a optimization algorithm based on K-means the Hadoop platform,optimize focus mainly on the choice of initial class cluster center,its basic idea is to follow the principle of "biggest" recently,based on the Mahout data model to choose an object is set to the first initial cluster center,then set up the second initial class cluster center with the first class cluster center point farthest away from the sample,and then set the initial cluster center is the third and have set the initial cluster center point distance from the recent sample sample points of maximum value,repeated iteration can get a number for the K value of initial cluster center collection,and through the graphs to analyze parallel programming model and implementation.Secondly,in view of the K-means optimization algorithm can accurately estimate the type of cluster center number K value,and puts forward the Hadoop platform based on fast searching and peak density clustering algorithm(CFSFDPH),CFSFDPH algorithm with the principle of "the whole to a" first of all,the data set into several groups,and then based on graphs programming model for each group independently execute a CFSFDP algorithm,resulting in a group of local clustering result set and mark all the clustering center of the classified attributes,and by implementing the Reduce function for each clustering result set of the most representative local clustering center set n CFSFDP clustering,in order to get the whole data set the final clustering center,finally by executing the Map function to update the classification attribute values of all points.Finally,the comparison results between 3 methods showed that when the data volume is small,the effect of k-means optimization algorithm is better than the other two.When the data volume is larger,the CFSFDPH algorithm works best.
Keywords/Search Tags:Hadoop, MapReduce, Mahout, Clustering Algorithm, K-means, CFSFDP
PDF Full Text Request
Related items