Font Size: a A A

Improved K-means Clustering Algorithm Based On MapReduce Framework

Posted on:2020-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y SongFull Text:PDF
GTID:2428330572973308Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Clustering is a very popular research direction in data mining.It is a process of dividing data sets into different clusters.The purpose of clustering is to make samples in the same cluster have higher similarity and samples in different clusters have lower similarity.With the rapid development of information technology and the production of massive data,people's requirements on the efficiency,reliability and scalability of clustering algorithm are gradually improved,making the massive data clustering become particularly important.Among many clustering algorithms,the partitioning based K-means clustering algorithm has always been popular due to its simple principle and easy operation.However,with the continuous research on the algorithm,its advantages and disadvantages are gradually exposed.This paper mainly studies the optimization of K-means clustering performance under the environment of massive data.From the perspective of improve accuracy and efficiency of the clustering algorithm,in this paper,the existing clustering algorithm based on partition are studied,mainly solves the K-means algorithm selection and phase of the initial clustering center in cyber function is sensitive to outliers and noise problems,and on this basis,proposed based on graphs under the framework of K-means clustering algorithm is improved.Firstly,the heterogeneous function in the K-means algorithm is calculated based on Euclidean distance,which is sensitive to outliers and outliers and susceptible to noise.In particular,when the data volume is increased and the attribute type is complex,the heterogeneous degree between data samples cannot be accurately calculated.By using Chebychev Distance to carry out the internal weighting of Euclidean Distance,that is,using the normalization idea to eliminate the sensitivity of Euclidean Distance to noise points and outliers,the data object can be more scientifically divided into its own clustering set,and then a new heterogeneous formula is given.Secondly,by improving the MapReduce programming model,the K-means algorithm is deployed in the improved MapReduce programming model for parallelization,so as to accelerate the speed of K-means algorithm in processing massive data while ensuring the quality of clustering.In order to verify the effectiveness of the improved algorithm,a simulation experiment was carried out on the UCI data set and a comparative analysis was conducted with the existing improved k-means algorithm.The experimental results showed that the improved algorithm improved the accuracy and convergence speed of clustering.Finally,the improved clustering algorithm is applied to the analysis of Uber and diabetes data sets.Cluster analysis was conducted on Uber taxi data to help taxi drivers grasp urban demand and provide users with a faster way to travel.The data of patients with diabetes were predicted by clustering,the indicators of patients were analyzed,and the risk of diabetes was predicted,indicating that the algorithm has a good application prospect in medical data analysis.
Keywords/Search Tags:Clustering, K-means algorithm, Dissimilarity function, MapReduce model
PDF Full Text Request
Related items