Font Size: a A A

Research On Mapreduce Based Big Data K-means Clustering Algorithm

Posted on:2015-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:X L CuiFull Text:PDF
GTID:2298330467986612Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In many research directions of data mining, clustering is popular and hot. Clustering is a process that divides data objects into different clusters, whose goal is to get high similarity through the same cluster and low similarity among different clusters. Recently, with the rapid development of information technology and the expansion of the amount of data, the requirements of clustering for performance, reliability and scalability have been improved step by step, and the clustering of big data becomes particularly important. Among numerous clustering algorithms, partitional clustering K-means algorithm is very popular because of its simplicity, this article mainly studies the performance optimization of big data K-means clustering using MapReduce.In order to meet the requirements of big data processing, regarding the limitation of a single machine, a natural solution is to consider parallelism in a distributed computational environment. Some researchers use MapReduce parallel programming architecture for big data clustering and have improved the performance. However, there are iterated job executions during K-means clustering using MapReduce. During each iterated job, Mappers need to read original data from Hadoop File System, simultaneously all data shuffles through the whole cluster network, and transmit the data to appropriate Reducers for further processing. They lead to high I/O and network cost and have not been well solved.Considering the bottleneck of K-means algorithm using MapReduce, this article proposes a novel processing model using MapReduce, which uses uniform random sampling and iterated sampling to Reduce the data size. It does iterated computing inside MapReduce job, and avoids the drawbacks of job repeated start time, big data repeated read time and multiple shuffle time, Reduce the I/O and network cost, achieve high performance and Reduce the effect of outliers in dataset to clustering result. At the same time, we conduct two data merging strategies, WMC and DMC to promote the accuracy of clustering result. Extensive experiments in clusters show that our proposed methods are efficient and scalable, and multiple samples processing strategy can Reduce the effect of outliers in dataset.
Keywords/Search Tags:Big data, K-means, MapReduce, Efficiency, Scalable
PDF Full Text Request
Related items