Analysis And Research On Parallel Clustering Algorithm Based On Hadoop

Posted on:2016-09-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Xu

Full Text:PDF

GTID:2308330464968489

Subject:Computer application technology

Abstract/Summary:

With the rapid development of information age, a large number of mobile Internet devices, increasing the amount of data that is associated with the equipment of explosive growth, including social production data and scientific data.Academia and industry are how to acquire knowledge from huge amounts of data have urgent demand.This article through to conduct the thorough research to the density of DBSCAN clustering algorithm, in view of the traditional DBSCAN density clustering algorithm, the computing time complexity is high, and address in Chinese information processing difficulties to improve two drawbacks.One is to design a Chinese address data denoising and mapping process;Second, transfer traditional DBSCAN algorithm, make it accord with graphs programming model;Third, in order to improve the efficiency of data classification, on the basis of data partitioning algorithm PRBP put forward a new data partition algorithm PRBP-DI (PRBP-Double Index).Finally the improved algorithm was run on Hadoop 2.2 cloud computing platform.The experimental results show that the original data after denoising and map pretreatment, the ID and the latitude and longitude data contains only for research.And extracted data mapped to two dimensional space of graphic representation with the Chinese address at the same address, reverse mapping available to the original address in Chinese, Chinese address pretreatment process is effectively;In the calculation of quantity of the same data contrast, improved PRBI-DI partition algorithm partitions time-consuming PRBP algorithm is a quarter to a third, PRBP-DI partition algorithm is more efficient;Finally on Hadoop 2.2 platform with the improved DBSCAN algorithm parallel computing distribution block of data on different nodes, it is concluded that the clustering number is the same as the traditional DBSCAN algorithm, and compared with the traditional DBSCAN algorithm clustering less time-consuming.More than two improved algorithm and an original data pretreatment process, the massive Chinese address on the efficiency and accuracy of data processing are improved.

Keywords/Search Tags:

MapReduce, Density clusteting DBSCAN, PRBP-DI, Cloud computing, Chinese address data

Related items

1	Cloud Computing And A Number Of Data Mining Algorithms Mapreduce Research
2	Research On DBSCAN Algorithm Based On Cloud Computing
3	A MapReduce Based Adaptive Density Clustering Algorithm
4	Performance Optimization And Applications Of MapReduce In Cloud Computing
5	The Research Of Task Scheduling Algorithm For Mapreduce Framework In Cloud Environment
6	Design And Implementation Of Visual Data Platform Based On MapReduce
7	The Application Of Improved DBSCAN On DBMAS
8	Research On Method Of Chinese Text Classification Based On Cloud Computing
9	Research On Verifiable Computation Based On MapReduce In Cloud Computing
10	Research On Clustering Algorithms Of Location Big Data Based On MapReduce