Font Size: a A A

Based On Hadoop Data Mining Algorithm Analysis And Research

Posted on:2013-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:M H ZhangFull Text:PDF
GTID:2218330374965354Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data Mining, which we also call it knowledge discovery in database. The nontrivial extraction of implicit, previously unknown, and potentially useful information from the large, incomplete, noisy, vague and random data.At present,data mining technology has been widely used in the decision and analysis of many fields such as financial, medical, military, management,etc.With the rapid development of computer and Internet technology, the amount of data also showed explosive growth, greatly increased the burden of data mining technology.The emergence of cloud computing proposed a new approach to data mining.these advantages,which are flexible computing power,massive storage capacity,cost saving,increasing efficiency,become an an effective way to solve the problems faced by data mining technology.Hadoop is an open source project of Apache for building cloud computering platform,distributed computing platform based on this project has been very stable and widely used in many areas.Hadoop platform, using the MapReduce programming model for distributed computing, HDFS to storage file.After transplanting traditional data mining algorithms to Hadoop platform,we can execute large-scale data mining tasks.Mahout is a new open source project under Apache, which provides a number of machine learning and data mining algorithms that using the MapReduce programming model.The goal is to help developers create intelligent applications more quickly and easily.So,first of all.this paper describes Hadoop platform with MapReduce programming model and HDFS,analyzes the core architecture and operating mechanisms of Hadoop.Then,it further discusses Mahout,makes a study of the data representation of Mahout.Take K-Means algorithm for example,to analyze the parallelization strategy in the Mahout.Finally,we take reuters-21578dataset to verify the validity of the algorithm through experiments,and analysis of experimental results,evaluate the K-Means algorithm.And we take different amount of data to run K-Means algorithm in serial and paralle modes, at last, compare their efficiency.
Keywords/Search Tags:Data Mining, Hadoop, Mahout, K-Means
PDF Full Text Request
Related items