Research On Mapreduce Based Big Data K-means Clustering Algorithm

Posted on:2015-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:X L Cui

Full Text:PDF

GTID:2298330467986612

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In many research directions of data mining, clustering is popular and hot. Clustering is a process that divides data objects into different clusters, whose goal is to get high similarity through the same cluster and low similarity among different clusters. Recently, with the rapid development of information technology and the expansion of the amount of data, the requirements of clustering for performance, reliability and scalability have been improved step by step, and the clustering of big data becomes particularly important. Among numerous clustering algorithms, partitional clustering K-means algorithm is very popular because of its simplicity, this article mainly studies the performance optimization of big data K-means clustering using MapReduce.In order to meet the requirements of big data processing, regarding the limitation of a single machine, a natural solution is to consider parallelism in a distributed computational environment. Some researchers use MapReduce parallel programming architecture for big data clustering and have improved the performance. However, there are iterated job executions during K-means clustering using MapReduce. During each iterated job, Mappers need to read original data from Hadoop File System, simultaneously all data shuffles through the whole cluster network, and transmit the data to appropriate Reducers for further processing. They lead to high I/O and network cost and have not been well solved.Considering the bottleneck of K-means algorithm using MapReduce, this article proposes a novel processing model using MapReduce, which uses uniform random sampling and iterated sampling to Reduce the data size. It does iterated computing inside MapReduce job, and avoids the drawbacks of job repeated start time, big data repeated read time and multiple shuffle time, Reduce the I/O and network cost, achieve high performance and Reduce the effect of outliers in dataset to clustering result. At the same time, we conduct two data merging strategies, WMC and DMC to promote the accuracy of clustering result. Extensive experiments in clusters show that our proposed methods are efficient and scalable, and multiple samples processing strategy can Reduce the effect of outliers in dataset.

Keywords/Search Tags:

Big data, K-means, MapReduce, Efficiency, Scalable

PDF Full Text Request

Related items

1	MapReduce-enabled scalable nature-inspired approaches for clustering
2	Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments
3	Research On Accelerating Of K-means Clustering Algorithm Using FPGA Based On MapReduce
4	A Study And Implementation Of Scalable Data Index Based On Mapreduce
5	Research On K-Means Algorithm Based On MapReduce
6	Research On Parallel Sampling K-Means Algorithm Based On MapReduce
7	Research On Extended Technology Of Scalable High Efficiency Video Coding
8	Using MapReduce for scalable and distributed processing of scientific XML data
9	Research On High Scalable Clustering Analysis Method
10	Research On Parallelization Of Clustering Algorithm Based On MapReduce