Data Mining Based On Hadoop Platform

Posted on:2014-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:W W Li

Full Text:PDF

GTID:2268330401973693

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of science and technology, user’s data based on the Internetpresent an explosive growth. The traditional computer’s architecture deal with big data is verydifficult. Cloud computing is put forward for processing complex big data. Hadoop is an opensource project founded by Apache Software Foundation, based on the cheap computer cluster,it can show remarkable computing power, storage capacity and operation ability. Data miningtechnology has also entered into a rapid development stage.This study based on Hadoop platform, researched several typical data mining algorithm’smain idea based on sufficient study of the mechanism of distributed applications, put forwardseveral plans to transform them into distributed platform, and realized the Hadoop version,which can help data mining researchers carry out better work based on Hadoop.The main research contents of this thesis are as follows:(1) There are three main kinds of algorithm in data mining field: classification, clusteringand association rules mining. Chose one typical algorithm from every kind, based on the datastored by text document, use MapReduce distributed programming framework, analysisedeach algorithm’s operation principle, made reconstruction scheme, for naivebayesclassification algorithm, K-modes clustering algorithm and ECLAT frequent itemsets miningalgorithm, realized their distributed version. All of them can be runned efficiently and stablyon Hadoop platform.(2) According to the unstructured data from Internet, use HiveQL language as a retrievalentrance, based on HBase distributed data warehouse, realized the distributed GAC-RDBclassification algorithm. Using high-level language as an entry point, do not need backgroundknowledge, such as Java or MapReduce, liberated developers free from tedious code, put theirmore efforts into specific business analysis and can finish other similar works more quicklyand effectively.Based on Northwest A&F University’s high performance computing cluster, designedmultiple sets of solutions to validate the reconstruction algorithms’ effectiveness and Hadoopplatform’s efficiency, drew curves according to the experiment data and made comprehensiveanalyses. Results showed that on the premise of guarantee algorithm’s efficienctiveness and accuracy, MapReduce programming framework can improve the efficiency of algorithm,decrease the time to process data; HiveQL query language can decrease the program’sdevelopment cycle, deal with all kinds of data stored in the distributed database moreconvenient.

Keywords/Search Tags:

cloud computing, data mining, hadoop, mapreduce, hbase

PDF Full Text Request

Related items

1	Design And Implementation Of The Data Analysis System Besed On Hadoop
2	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
3	The Process And Research Of Massive Data Mining Based On Cloud Computing
4	Research On Massive Digital Image Data Mining Based On Hadoop Cloud Platform
5	Parallel Data Mining Algorithm Research In Cloud
6	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
7	Based On The Parallel Implementation Of Multi-node Data Mining Algorithm
8	Research Of Massive Data Processing And Mining In Database Marketing Based On Hadoop
9	Parallel Algorithms Research Based On Hadoop And Hama
10	The Design Of The Cloud Computing System Based On Hadoop