Font Size: a A A

Implementation And Optimization Of Parallel Clustering Algorithm Based On MapReduce

Posted on:2018-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:B W ZhaoFull Text:PDF
GTID:2348330518986555Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is a commonly used method in data processing algorithms,it is often used in data mining,machine learning,pattern recognition,etc.There are many types of clustering analysis,such as partitional clustering,hierarchical clustering,density-based clustering,model-based clustering and grid-based clustering.Different clustering type has different clustering logic and its own optimization methods.A clustering algorithm named partitioning around medoids(or PAM for abbreviation),which belongs to partitional clustering,has become one of the most commonly used clustering algorithms since it was proposed,because PAM algorithm overcomes the problem of K-means: being sensitive to abnormal data.But traditional PAM algorithm works inefficiently when deals with big data set because of its high time complexity.Considering the expansion of data scale in modern society and performance constraint of single machine,in order to improve the clustering efficiency,implementing the parallelization of clustering algorithm in a distributed computing environment becomes a solution.MapReduce is a programming framework for handling largescale datasets,and it gains popularity for its simplicity,fault tolerance,flexibility and scalability since its birth.In recent years,more and more researchers apply MapReduce programming framework to obtain high performance for clustering algorithms.This paper focuses on the performance optimization of PAM and proposed two kinds of optimization methods based on MapReduce.Firstly,this paper proposed to combine swarm intelligence(SI for abbreviation)and PAM in order to lower the time complexity of PAM during the iteration of updating medoids.A sampling method named reservoir sampling takes the responsibility for updating medoids between every iteration rather than having a O(9)2)computation like traditional PAM algorithm.Meanwhile,this paper enhances the searching capabilities of PAM by taking advantage of ant colony optimization(ACO for abbreviation)algorithm.The solution space is constructed as ants visit data instances,ants will accumulate pheromone on the instances they have visited.The stronger pheromone an instance has,the more attractive it is to ants.Ants partition all instances into appropriate clusters based on pheromone and heuristic information.All the processing procedures described above are implemented by MapReduce programming framework and are conducted experiments in Hadoop cluster to verify the algorithm validity.The other optimization method is reducing the data scale that clustering algorithm needs to handle with by sampling.The MapReduce programming framework needs repeated times of restarting MapReduce jobs,reading data from file system or writing data into file system and data shuffling,these operations result in expensive cost of I/O and network communication which will have impacts on data processing efficiency.In this paper,we propose two clustering processing models based on MapReduce model,PAM algorithm and careful seeding to obtain high performance and maintain cluster validity of PAM in the same time.The performance evaluation and clustering validation experiments demonstrate that the methods we have proposed are efficient,robust and scalable.
Keywords/Search Tags:Partitioning Around Medoids, Ant Colony Optimization, Sampling, MapReduce, Big Data
PDF Full Text Request
Related items