Based On Hadoop Data Mining Algorithm Analysis And Research

Posted on:2013-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:M H Zhang

Full Text:PDF

GTID:2218330374965354

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data Mining, which we also call it knowledge discovery in database. The nontrivial extraction of implicit, previously unknown, and potentially useful information from the large, incomplete, noisy, vague and random data.At present,data mining technology has been widely used in the decision and analysis of many fields such as financial, medical, military, management,etc.With the rapid development of computer and Internet technology, the amount of data also showed explosive growth, greatly increased the burden of data mining technology.The emergence of cloud computing proposed a new approach to data mining.these advantages,which are flexible computing power,massive storage capacity,cost saving,increasing efficiency,become an an effective way to solve the problems faced by data mining technology.Hadoop is an open source project of Apache for building cloud computering platform,distributed computing platform based on this project has been very stable and widely used in many areas.Hadoop platform, using the MapReduce programming model for distributed computing, HDFS to storage file.After transplanting traditional data mining algorithms to Hadoop platform,we can execute large-scale data mining tasks.Mahout is a new open source project under Apache, which provides a number of machine learning and data mining algorithms that using the MapReduce programming model.The goal is to help developers create intelligent applications more quickly and easily.So,first of all.this paper describes Hadoop platform with MapReduce programming model and HDFS,analyzes the core architecture and operating mechanisms of Hadoop.Then,it further discusses Mahout,makes a study of the data representation of Mahout.Take K-Means algorithm for example,to analyze the parallelization strategy in the Mahout.Finally,we take reuters-21578dataset to verify the validity of the algorithm through experiments,and analysis of experimental results,evaluate the K-Means algorithm.And we take different amount of data to run K-Means algorithm in serial and paralle modes, at last, compare their efficiency.

Keywords/Search Tags:

Data Mining, Hadoop, Mahout, K-Means

PDF Full Text Request

Related items

1	Based On Hadoop Data Mining Algorithm Analysis And Research
2	Design And Implementation Of The Data Mining Platform Based On Mahout
3	A Research And Implementation Of Recommender System Based On Mahout And Hadoop
4	The Optimization Of Parallelized K-means Based On Mahout
5	Research And Implementation Of Big Data Analysis And Mining Technology Based On Hadoop In Telecommunications Industry
6	Oneof Text Clustering Algorithm Based On Big Data
7	Research On Algorithm Of Data Mining Based On Hadoop
8	The Study And Implementation Of Recommendation Technology Based On Hadoop And Mahout
9	Design And Implementation Of Data Mining Algorithm Under Big Data Platform
10	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine