Font Size: a A A

Research On Clustering Algorithm Based On Distributed Platform

Posted on:2015-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y P LiFull Text:PDF
GTID:2348330518470442Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering algorithm is an important research direction in the field of data mining,it has played an important role in extracting data information efficiently in the field of industry,commerce and science research. With the rapid development of information technology, the explosive growth of the amount of data generated in these fields,making the traditional clustering algorithm difficult to cope with such a large amount of data.Therefore,the parallel implementation of traditional clustering algorithm in the distributed platform is more emergency,so that it can make up the lack of performance in single computer algorithm,and make full use of the computing power of distributed paltform.Hadoop,as a new distributed computing platform to cope with large-scale data,has the characteristic of open-source and easy-extensibility,making it more and more popular.It has been used to be the basic disrtibuted computing platform to deal with the big data challenges by more and more enterprises.Therefore,this thesis research how to make the traditional computer clustering algorithm to be parallelized on Hadoop.CLARA,as a clustering algorithm,applied statistical sampling theory to the selection of the center medoids of clusters.It has the characteristic of handling large-scale data and high efficiency.But there are many iterations in the algorithm,it will reduce the efficiency in some situation.In order to solve this problem, this thesis proposed a parallel implementation of CLARA.First of all,after analyzing the theory and characters of the traditional clustering algorithm,include the CLARA algorithm,the complex operations problem is found in the implementation process of CLARA.In order to simplify the algorithm,a new idea is proposed by the theory of statistics using average value approximate method.The implemetation of the new algorithm is designed and tested in the experiment. So that,the effectiveness of the modified algorithm is proved.Then,after combining with the technical features of MapReduce computing framework,a general idea of applying algorithm to MapReduce is introduced.The structural characteristics and implementation process of CLARA algorithm is analyzed,and the possibility and feasibility of parallel implementation has been estimated. Then the concrete steps of the parallel algorithm are designed.Futhermore,the modified algorithm of CLARA is also designed to be parallized and the steps of the implementation are also designed.Finally,we set up a Hadoop platform to test all the modified algorithms, and carry out detailed analysis about the results of the experiment. In the comparative analysis, the modified algorithms have a good performance in the experiment and proves that the parallel algorithm has some innovative and practical significance.
Keywords/Search Tags:clustering analysis, distributed computering, parallel clustering, Hadoop, CLARA
PDF Full Text Request
Related items