Font Size: a A A

Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform

Posted on:2014-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2268330422460770Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the scale of the Internet growing, it generated massive data at the same time.The computer performance and programming model constrained the traditional datamining technology. It is significantly powerless in processing that data.Data mining aims to search the information hidden in the large amount of dataautomatically. Because the single processor is limited by the computing power andmemory capacity in face of massive data with high-dimensional data, the solution ofparallel processing multiprocessors were proposed. The most common approach is thatdividing the large-scale data set into a plurality of data subset and distributing the subset toevery node with a single processor. When data processing in each node is completed, theresults from each node are aggregated to the final result. Compared with the singleprocessor, multiprocessor can significantly improve the efficiency of data mining.The methods in the field of parallelization data mining mainly including the methodbased on MPI, PVM and the method based on the CPU, GPU. The former is simple andeasy to use, but it a have higher requirement on the organization of data. The latter need ahigher requirements of the hardware, is not conducive to large-scale use. Overall, thesemethods make the user take too much focus on how to implement the parallel computingand make the user hard to attend to other aspects.MapReduce is a programming model proposed by Google as early as2004, itsimplifies the development of parallel programs, and promote the application field ofparallel computing. Google MapReduce is a commercial system, Apache Hadoop platformimplementation of the MapReduce programming model and HDFS (Hadoop distributedfile system) which is similar with GFS (Google File System) in2008. In recent years, withthe development and application of Hadoop platform, data mining of large data sets hasbecome more popular. So the research on parallel clustering algorithm based on Hadoopplatform is proposed in this paper. The main idea of cloud computing is that distributing the computing tasks to thevirtual resource pool which consist of in a large number of computers. The variousapplications could obtain computing power, storage space and a variety of softwareservices according to user need. This paper deployments the cloud computing platform,achieves clustering algorithm parallelization using MapReduce model, and optimizes datasegmentation, task allocation, parallel processing, fault tolerance and other details. Becauseof the various clustering algorithms, this paper studies the k-means clustering algorithmonly, combines the traditional k-means algorithm with Canopy algorithm. The improvedalgorithm is applied in Hadoop platform, experiments show that the parallel Canopyalgorithm using MapReduce model could greatly improve the speed of text clustering inprocessing the SogouC dataset and FuDan dataset. Therefore, the Canopy algorithm ismuch better capable of the clustering large data sets than k-means algorithm.
Keywords/Search Tags:Document Clustering, K-means, Canopy, Hadoop Platform, MapReduceParallelization
PDF Full Text Request
Related items