The Research Of Data Optimization And Application Of Clustering Algorithm Based On Hadoop

Posted on:2016-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:L Gu

Full Text:PDF

GTID:2308330470469713

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Parallel clustering algorithm is one of the hottest areas in data mining, and has been widely used. Cluster analysis is the technology to find the most representative cluster. Fujian Province has a subtropical monsoon climate; long coastline; high average temperature; windy weather systems occur frequently, wind resources are abundant. In this paper, we use the 1961-1990 years of meteorological data of 70 weather stations in Fujian Province. The temperature and wind speed data do gridding processing, we use the parallel GK-means algorithm to study temperature division in Fujian Province, and use the parallel G-DBSCAN algorithm to study wind division in Fujian Province.This paper mainly studied the data optimization of clustering algorithm based on Hadoop and its application, research contents are as follows:(1) The initial cluster centers k-means algorithm is randomly selected, this can lead to an unstable algorithm, meanwhile, the results of the algorithm is easily affected Outlier. To solve these problems, used the grid method improved algorithm. We use standard database UCI Iris dataset tested intra-class distance, the accuracy rate of GK-means clustering algorithm. The experimental results show that, the stability and the convergence of the GK-means clustering algorithm compared to K-means algorithm improvement(2) The main purpose of DBSCAN algorithm is to find core object, DBSCAN in the search for the core of the object will occupy a lot of memory. We also use the grid method improved algorithm, and proposed G-DBSCAN clustering algorithm. Selection criteria UCI database Iris, Wine, Glass and Indian datasets tested algorithm memory footprint and time consumption. Experimental results show that memory and time consumption of G-DBSCAN clustering algorithm consumes far less than DBSCAN clustering algorithm.(3) GK-means algorithm and the G-DBSCAN algorithm for large data processing efficiency are not high, and the processing time is too long. In order to solve this problem, we will carry out parallel algorithm design based on Hadoop, in order to increase efficiency of algorithm to handle large amounts of data. Experimental results show that the parallel of GK-means algorithm and the parallel of G-DBSCAN algorithm have good speedup, and with the increase of the data set, the algorithm will also better performance Speedup.

Keywords/Search Tags:

cluster analysis, Grid, K-meas, DBSCAN, MapReduce

PDF Full Text Request

Related items

1	The Application Of Improved DBSCAN On DBMAS
2	Research On Clustering Algorithms Of Location Big Data Based On MapReduce
3	KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX
4	Research On Subspace Cluster Algorithms On Simil Arity And DBSCAN
5	Application And Research Of DBSCAN Based On Hadoop Platform
6	Research Of Evidence Fusion Method Based On DBSCAN Clustering
7	Research On DBSCAN Algorithm Based On Grid And Density-ratio
8	Based On A Grid Of Dbscan And Cluster Boundary Technology
9	The Design And Implementation Of A MapReduce Computing Framework Based On GPU Cluster
10	Research On Low Power Scheduling Technology For Heterogeneous Cluster Based On MapReduce