Research On Parallel Clustering Algorithm On Hadoop Platform

Posted on:2019-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Guo

Full Text:PDF

GTID:2428330575973659

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Clustering algorithm plays a vital rolein data mining technology which is often used to measure and analysis the similarity between different objects in data source and can be used as preprocessing steps in other algorithms in data mining.The Hadoop platform can allocate the entire computing task to multiple computers on the resource pool,and it has the ability to efficiently process massive amounts of data.The clustering algorithm combined with Hadoop is beneficial to realize or improve the ability of the original algorithm to deal with massive data.This article in view of the traditional K-means algorithm by random initial cluster center is easy to lead to poor clustering effect of faults,this paper proposes a optimization algorithm based on K-means the Hadoop platform,optimize focus mainly on the choice of initial class cluster center,its basic idea is to follow the principle of "biggest" recently,based on the Mahout data model to choose an object is set to the first initial cluster center,then set up the second initial class cluster center with the first class cluster center point farthest away from the sample,and then set the initial cluster center is the third and have set the initial cluster center point distance from the recent sample sample points of maximum value,repeated iteration can get a number for the K value of initial cluster center collection,and through the graphs to analyze parallel programming model and implementation.Secondly,in view of the K-means optimization algorithm can accurately estimate the type of cluster center number K value,and puts forward the Hadoop platform based on fast searching and peak density clustering algorithm(CFSFDPH),CFSFDPH algorithm with the principle of "the whole to a" first of all,the data set into several groups,and then based on graphs programming model for each group independently execute a CFSFDP algorithm,resulting in a group of local clustering result set and mark all the clustering center of the classified attributes,and by implementing the Reduce function for each clustering result set of the most representative local clustering center set n CFSFDP clustering,in order to get the whole data set the final clustering center,finally by executing the Map function to update the classification attribute values of all points.Finally,the comparison results between 3 methods showed that when the data volume is small,the effect of k-means optimization algorithm is better than the other two.When the data volume is larger,the CFSFDPH algorithm works best.

Keywords/Search Tags:

Hadoop, MapReduce, Mahout, Clustering Algorithm, K-means, CFSFDP

PDF Full Text Request

Related items

1	Research Of Clustering Algorithm Based On Mahout
2	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
3	Parallel Clustering Algorithm Based On MapReduce
4	The Optimization Of Parallelized K-means Based On Mahout
5	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
6	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
7	Research On Clustering Algorithm On Hadoop Platform
8	Oneof Text Clustering Algorithm Based On Big Data
9	Improved Indoor Positioning System Based On Clustering Algorithm And Feature Extraction
10	The Clustering Algorithm Based On Hadoop Parallel Analysis And Applied Research