The Key Research Of Clustering Algorithm Parallelization On The Platform Of Cloud Computing

Posted on:2016-06-24

Degree:Master

Type:Thesis

Country:China

Candidate:S C Wei

Full Text:PDF

GTID:2298330467980940

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, the trend of distributed data mining and applicationbased on Cloud Computing has been irreversible. In this area, the distributed platform whichcalled Hadoop are widely used in industry, and has become the DE facto standard. Clusteringanalysis, as an important data mining analytical method, has been widely used in industrial,commercial, scientific research and other fields. Especially, the k-means clustering algorithmwith rapid, simple and efficient has been widely used. But the algorithm has several majordefects, the problem that clustering result is sensitive to the selection of the initial clusteringcenter is a major defect of the algorithm. Some foreign scholars have advanced k-means++initialization algorithm that is proposed to solve the defects. To a great extent, it overcomesthe shortcomings. But this initialization algorithm due to "internal order" is difficult to realizethe parallelization and deal with high-dimensional data sets.In order to break through this bottleneck, the distributed technology based on cloudcomputing is selected as the breakthrough point in this paper, the parallel computing model ofMapReduce is Studied deeply. Then, we focus on the “k-means++” serial initializationalgorithm, finally one parallel initialization algorithm based on the principle of"Over-Sampling" is proposed, it is called “pk-means++”. At the same time, the parallelimplementation of the k-means based on the pk-means++initialization algorithm and basedon the random initialization algorithm has been realized. Finally, with the two differentimplementation, the massive data sets have been experimented. Experiments show that thek-means based on the pk-means++initialization algorithm, compared to using the Randominitialization algorithm, can greatly improve the overall performance of the k-meansalgorithm, improves the convergence of iterative calculation, and can get stable clusteringresults.With the further study of text clustering technology, One text clustering model basedon the pk-means++initialization algorithm is designed, the purpose is to test the executiontime and scalable of initialized parallel algorithm and to verify the superiority of the new algorithm and continue to optimize it. Experiments show that the text clustering model basedon k-means++initialization algorithm have very high speed up ratio as well as high scalablecluster, the limitation that dealing with a massive high-dimensional data of the k-meansalgorithm have been successfully solved.

Keywords/Search Tags:

Cloud Computing, MapReduce, clustering, k-means, pk-means++

PDF Full Text Request

Related items

1	Research Of K-means Clustering Algorithm Based On MapReduce
2	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Research On Parallelization Of K - Means Clustering Algorithm Based On MapReduce
4	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
5	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Research On A Web Log Mining Technology With Improved K-means Algorithm
7	Parallel Clustering Algorithm Based On MapReduce
8	The Research On Parallel Computing Technology In Precise Agricultural Climate Division
9	Research Of Clustering Algorithm Based On Cloud Computing Platform
10	Improved K-means Clustering Algorithm Based On MapReduce Framework