Font Size: a A A

The Research Of Data Optimization And Application Of Clustering Algorithm Based On Hadoop

Posted on:2016-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:L GuFull Text:PDF
GTID:2308330470469713Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Parallel clustering algorithm is one of the hottest areas in data mining, and has been widely used. Cluster analysis is the technology to find the most representative cluster. Fujian Province has a subtropical monsoon climate; long coastline; high average temperature; windy weather systems occur frequently, wind resources are abundant. In this paper, we use the 1961-1990 years of meteorological data of 70 weather stations in Fujian Province. The temperature and wind speed data do gridding processing, we use the parallel GK-means algorithm to study temperature division in Fujian Province, and use the parallel G-DBSCAN algorithm to study wind division in Fujian Province.This paper mainly studied the data optimization of clustering algorithm based on Hadoop and its application, research contents are as follows:(1) The initial cluster centers k-means algorithm is randomly selected, this can lead to an unstable algorithm, meanwhile, the results of the algorithm is easily affected Outlier. To solve these problems, used the grid method improved algorithm. We use standard database UCI Iris dataset tested intra-class distance, the accuracy rate of GK-means clustering algorithm. The experimental results show that, the stability and the convergence of the GK-means clustering algorithm compared to K-means algorithm improvement(2) The main purpose of DBSCAN algorithm is to find core object, DBSCAN in the search for the core of the object will occupy a lot of memory. We also use the grid method improved algorithm, and proposed G-DBSCAN clustering algorithm. Selection criteria UCI database Iris, Wine, Glass and Indian datasets tested algorithm memory footprint and time consumption. Experimental results show that memory and time consumption of G-DBSCAN clustering algorithm consumes far less than DBSCAN clustering algorithm.(3) GK-means algorithm and the G-DBSCAN algorithm for large data processing efficiency are not high, and the processing time is too long. In order to solve this problem, we will carry out parallel algorithm design based on Hadoop, in order to increase efficiency of algorithm to handle large amounts of data. Experimental results show that the parallel of GK-means algorithm and the parallel of G-DBSCAN algorithm have good speedup, and with the increase of the data set, the algorithm will also better performance Speedup.
Keywords/Search Tags:cluster analysis, Grid, K-meas, DBSCAN, MapReduce
PDF Full Text Request
Related items