Font Size: a A A

The Key Research Of Clustering Algorithm Parallelization On The Platform Of Cloud Computing

Posted on:2016-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:S C WeiFull Text:PDF
GTID:2298330467980940Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the trend of distributed data mining and applicationbased on Cloud Computing has been irreversible. In this area, the distributed platform whichcalled Hadoop are widely used in industry, and has become the DE facto standard. Clusteringanalysis, as an important data mining analytical method, has been widely used in industrial,commercial, scientific research and other fields. Especially, the k-means clustering algorithmwith rapid, simple and efficient has been widely used. But the algorithm has several majordefects, the problem that clustering result is sensitive to the selection of the initial clusteringcenter is a major defect of the algorithm. Some foreign scholars have advanced k-means++initialization algorithm that is proposed to solve the defects. To a great extent, it overcomesthe shortcomings. But this initialization algorithm due to "internal order" is difficult to realizethe parallelization and deal with high-dimensional data sets.In order to break through this bottleneck, the distributed technology based on cloudcomputing is selected as the breakthrough point in this paper, the parallel computing model ofMapReduce is Studied deeply. Then, we focus on the ā€œk-means++ā€ serial initializationalgorithm, finally one parallel initialization algorithm based on the principle of"Over-Sampling" is proposed, it is called ā€œpk-means++ā€. At the same time, the parallelimplementation of the k-means based on the pk-means++initialization algorithm and basedon the random initialization algorithm has been realized. Finally, with the two differentimplementation, the massive data sets have been experimented. Experiments show that thek-means based on the pk-means++initialization algorithm, compared to using the Randominitialization algorithm, can greatly improve the overall performance of the k-meansalgorithm, improves the convergence of iterative calculation, and can get stable clusteringresults.With the further study of text clustering technology, One text clustering model basedon the pk-means++initialization algorithm is designed, the purpose is to test the executiontime and scalable of initialized parallel algorithm and to verify the superiority of the new algorithm and continue to optimize it. Experiments show that the text clustering model basedon k-means++initialization algorithm have very high speed up ratio as well as high scalablecluster, the limitation that dealing with a massive high-dimensional data of the k-meansalgorithm have been successfully solved.
Keywords/Search Tags:Cloud Computing, MapReduce, clustering, k-means, pk-means++
PDF Full Text Request
Related items