Font Size: a A A

Research On Parallelization Of Clustering Algorithm Based On Heterogeneous Hadoop Platform

Posted on:2015-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:W J WeiFull Text:PDF
GTID:2298330431492565Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important and concerned research method in the field ofdata mining. The density-based clustering algorithm has been widely studied becauseit can effectively rule out the noise data and discover clusters of arbitrary shape. Inthe information age, people can get a variety of data from the network, resulting in asharp increase in the amount of data in the database, then it is very difficult to getvaluable information and knowledge from these massive data, this prompts people tostudy large-scale data parallel. As the development of parallel computing, distributedcomputing and grid computing, cloud computing has become a hot research topic.Hadoop is an open source platform of cloud computing, mainly for the huge amountsof data in parallel reseach. It runs on the cluster, which consists of a large number ofinexpensive computers. So it can save computational cost effectively and improvedata processing capability.This thesis mainly studies how to achieve massive data clustering in theheterogeneous Hadoop platform. First, We design a Proportional Data Placementstrategy aimed at the current Hadoop implementation assumes that computing nodesin a cluster are homogenous and use default data placement that reduce theMapReduce performance. The main idea is that by computing the node’s rate,combining the splitting the data, it can form a number of skew data sets. Each nodeaccording to their own performance to select the distribution and storage of datablock, thus make the running time of each node is basically the same and data transferis reduced. Next, MapReduce divides data into disjoint data blocks, cutting off thelink between the original data. So we propose a data division method of intersectionarea. Then, combining heterogeneous Hadoop platform, we use MapReduceprogramming ideas to achieve DBSACN algorithm parallelism. Finally, we constructcloud environment by heterogeneous hadoop to test Proportional data placement andDBSCAN algorithm parallelization respectively. Experimental results show that ourdata placement strategy can improve MapReduce performance effectively and make the nodes’ data rebalance. Parallel DBSCAN clustering algorithm can greatlyimprove the processing efficiency of large data sets and have good scalability.
Keywords/Search Tags:Heterogeneous Hadoop Platform, parallel clustering, DBSCANalgorithm, compute rate, Proportional data placement strategy
PDF Full Text Request
Related items