Font Size: a A A

The Clustering Algorithm Based On Hadoop Parallel Analysis And Applied Research

Posted on:2013-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:A P ChenFull Text:PDF
GTID:2248330374986013Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasingly rapid development of computer technology and the rapidspread of the Internet, the data (including structured and unstructured text data) whichpeople in contact with is growing explosively. At present, how to effectively miningvaluable information from massive data is of great significance. Cluster analysis is oneof the core technologies of data mining. No matter from efficiency or from thecomputational complexity, the traditional single clustering algorithms have beenunable to meet the processing needs of massive information, cloud computingtechnology development provides a new research direction for cluster analysis.As an open source project of Apache, Hadoop is a distributed computingframework for building cloud platforms. Hadoop platform uses HDFS (distributed filesystem) to store data, and uses MapReduce programming model to implement parallelprocessing of massive data. According to the characteristics of traditional clusteringalgorithms, and the structure of MapReduce programming model, developers canimplement quick parallelization of clustering algorithms efficiently and easily, withoutpaying much attention to the specific communication of parallelization.In this thesis, lots of traditional clustering algorithms are analyzed and compared,and appropriate improvements are made about the randomicity of choosing initialclustering centers and the local optimum of clustering results. Further researches aremade on how to apply the improvements combining with Hadoop framework to therelated fields of practical projects. The results show that the improvements enable theefficiency of algorithms and the accuracy of clustering results to improve significantly.The focus of this thesis is summarized as follows:1) The study of MapReduce programming model, and analysis of the advantagesand disadvantages of the traditional K-means algorithm and canopy algorithm,the ideaof the twice clustering based on canopy algorithm(CTK) is proposed, the parallelframework of CTK on Hadoop is given, and detailed implementation is discribed.2) Analysis of the maximum and minimum distance algorithm, the idea ofK-means clustering based on maximum and minimum distance algorithm (MMKMEANS) is proposed. combined with the MapReduce programming framework,parallelization of MMKMEANS on Hadoop is implemented3) Analysis of the whole process of clustering for hot sports generated, researchthe strategy of nutch crawler technology to get webpage information, and the processof parallelization in which the parsed web content is converted to text vector thatprovides the experimental data for above algorithms, and give the parallelimplementation of clustering for hot spots generated.4) The experimental results verify the superiority of above algorithms in the textclustering in the area of the clustering quality, precision, the parallel speedup and soon.
Keywords/Search Tags:K-means clustering, MMKMEANS, CTK, MapReduce
PDF Full Text Request
Related items