Research On Parallelization Of Clustering Algorithm Based On Heterogeneous Hadoop Platform

Posted on:2015-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:W J Wei

Full Text:PDF

GTID:2298330431492565

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Cluster analysis is an important and concerned research method in the field ofdata mining. The density-based clustering algorithm has been widely studied becauseit can effectively rule out the noise data and discover clusters of arbitrary shape. Inthe information age, people can get a variety of data from the network, resulting in asharp increase in the amount of data in the database, then it is very difficult to getvaluable information and knowledge from these massive data, this prompts people tostudy large-scale data parallel. As the development of parallel computing, distributedcomputing and grid computing, cloud computing has become a hot research topic.Hadoop is an open source platform of cloud computing, mainly for the huge amountsof data in parallel reseach. It runs on the cluster, which consists of a large number ofinexpensive computers. So it can save computational cost effectively and improvedata processing capability.This thesis mainly studies how to achieve massive data clustering in theheterogeneous Hadoop platform. First, We design a Proportional Data Placementstrategy aimed at the current Hadoop implementation assumes that computing nodesin a cluster are homogenous and use default data placement that reduce theMapReduce performance. The main idea is that by computing the node’s rate,combining the splitting the data, it can form a number of skew data sets. Each nodeaccording to their own performance to select the distribution and storage of datablock, thus make the running time of each node is basically the same and data transferis reduced. Next, MapReduce divides data into disjoint data blocks, cutting off thelink between the original data. So we propose a data division method of intersectionarea. Then, combining heterogeneous Hadoop platform, we use MapReduceprogramming ideas to achieve DBSACN algorithm parallelism. Finally, we constructcloud environment by heterogeneous hadoop to test Proportional data placement andDBSCAN algorithm parallelization respectively. Experimental results show that ourdata placement strategy can improve MapReduce performance effectively and make the nodes’ data rebalance. Parallel DBSCAN clustering algorithm can greatlyimprove the processing efficiency of large data sets and have good scalability.

Keywords/Search Tags:

Heterogeneous Hadoop Platform, parallel clustering, DBSCANalgorithm, compute rate, Proportional data placement strategy

PDF Full Text Request

Related items

1	Research On Hadoop Based Data Placement Strategy
2	Research On Hadoop Based Iterative Data Processing And Data Placement Strategy
3	Research And Implementation Of Hadoop Load Balancing Strategy In Heterogeneous Environment
4	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Clusters
5	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Cluster
6	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
7	English On Design And Implementation Of Network Data Parallel Processing System Based On Hadoop Platform
8	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
9	Research On Mining Taxi Pick-up Hotspots Area Based On Big Data Hadoop Platform
10	Study On The Robust Optimization Of HADOOP Under The Restriction Of Cluster Computing Efficiency