Font Size: a A A

K-Means Algorithm Design And Implementation Based On Hadoop And Mahout

Posted on:2017-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:J Z WangFull Text:PDF
GTID:2308330482979894Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology, the amount of data on the Internet is increasing. It is particularly important to find valuable information from vast amounts of data. When dealing with huge amounts of data, a single machine has defective calculative ability, limited storage space and small memory.Migrated the traditional algorithm to Hadoop platform through a parallel improvement can effectively solve this problem.This article uses the platform which is based on Hadoop and Mahout. Hadoop and Mahout are open source projects under the Apache, Hadoop is a distributed system framework. Mahout is a data mining algorithms library on cloud platforms. Hadoop platform has a strong computing power, combined with Mahout can analyze vast amounts of data. This paper mainly studies the Kmeans algorithm of data mining technology, uses the Canopy algorithm for data preprocessing. In previous studies of the Canopy-kmeans algorithm, this paper gives a weighted Euclidean distance method based on analytic hierarchy process (AHP). This paper also used improved algorithm clusters the KDD99 UCI standard data sets and analysis the result. The main work is as follows:(1) Introduce the related technologies of paper, especially Hadoop and Mahout.(2) Analyze the advantages and disadvantages of Kmeans algorithm. The Canopy algorithm is applied to data preprocessing before Kmeans clustering algorithm, which can reduce the influence of noise points, better to determine the K value and cluster center. In addition, this paper also gives a weighted Euclidean distance method based on AHP, which can better applied to multidimensional data set.(3) Analyze and improve the Kmeans algorithm on Mahout, cluster the standard data sets and analysis result. The improved algorithm has higher stability and accuracy, and can be to deal with huge amounts of data.
Keywords/Search Tags:K-Means, Cloud computing, Data minding, Hadoop, Mahout
PDF Full Text Request
Related items