Font Size: a A A

Analysis And Research On Parallel Clustering Algorithm Based On Hadoop

Posted on:2016-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2308330464968489Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information age, a large number of mobile Internet devices, increasing the amount of data that is associated with the equipment of explosive growth, including social production data and scientific data.Academia and industry are how to acquire knowledge from huge amounts of data have urgent demand.This article through to conduct the thorough research to the density of DBSCAN clustering algorithm, in view of the traditional DBSCAN density clustering algorithm, the computing time complexity is high, and address in Chinese information processing difficulties to improve two drawbacks.One is to design a Chinese address data denoising and mapping process;Second, transfer traditional DBSCAN algorithm, make it accord with graphs programming model;Third, in order to improve the efficiency of data classification, on the basis of data partitioning algorithm PRBP put forward a new data partition algorithm PRBP-DI (PRBP-Double Index).Finally the improved algorithm was run on Hadoop 2.2 cloud computing platform.The experimental results show that the original data after denoising and map pretreatment, the ID and the latitude and longitude data contains only for research.And extracted data mapped to two dimensional space of graphic representation with the Chinese address at the same address, reverse mapping available to the original address in Chinese, Chinese address pretreatment process is effectively;In the calculation of quantity of the same data contrast, improved PRBI-DI partition algorithm partitions time-consuming PRBP algorithm is a quarter to a third, PRBP-DI partition algorithm is more efficient;Finally on Hadoop 2.2 platform with the improved DBSCAN algorithm parallel computing distribution block of data on different nodes, it is concluded that the clustering number is the same as the traditional DBSCAN algorithm, and compared with the traditional DBSCAN algorithm clustering less time-consuming.More than two improved algorithm and an original data pretreatment process, the massive Chinese address on the efficiency and accuracy of data processing are improved.
Keywords/Search Tags:MapReduce, Density clusteting DBSCAN, PRBP-DI, Cloud computing, Chinese address data
PDF Full Text Request
Related items