Font Size: a A A

Research Of Clustering Algorithm Based On Mahout

Posted on:2015-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:H YuFull Text:PDF
GTID:2298330431466874Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet, the amount of data generated are also rapidly expanding in the field of web data mining field, cause traditional clustering algorithms cannot meet the requirements of massive data processing. Especially for the document clustering, the traditional clustering algorithms can deal with artificial data set. and document data tend to be small, to adapt to the single machine processing.IIHowever, it is not realistic, because the document data sets are often huge and full of noise. Cloud computing is a new platform which focus on big data and distributed parallel processing, has been developing rapidly in recent years, and achieved initial success in business, which also attract the attention of academics. In the cloud computing era, we can redesign and implement the traditional clustering algorithm based on cloud computing platform, to reduce the time and space complexity and solve the bottleneck problems in the large data storage and computing efficiently.Apache lladoop is an open source project on cloud computing, allows for distributed processing of large data sets on the large-scale cluster through simple MapRcducc calculation model. Depend on expensive hardware from the past to use distributed storage and parallel computing between cheap nodes to obtain high availability. In addition, Hadoop is able to detect and solve the problem of node failure, and provide high availability services in the case of individual node failure. The MapReduce calculation model relies on IIDFS (Hadoop Distributed File System) in the underlying, which support local storage and computation of the cluster nodes. Apache Mahout is an open source algorithms library in the field of data mining and machine learning, and these algorithms are based on the MapReduce programming model and HDFS.In this paper, parallel distributed design and implementation of clustering algorithm based on Mahout are discussed, take the typical clustering algorithm as an example. At the same time, related algorithm is improved and some methods and skills of designing parallel algorithm are summarized. Mahout is an excellent platform for the processing of big data can be expected, but its performance has not been fully tested. So, in addition to discuss parallel design and improvement of clustering algorithm, also test the performance and effect of some clustering algorithms on the platform through the experiment, and Preliminary discuss whether Mahout/Hadoop is a great platform for big data processing.
Keywords/Search Tags:cloud computing, big data, clustering algorithms, Hadoop, HDFS, MapReduce, Mahout
PDF Full Text Request
Related items