Font Size: a A A

Data Mining Based On Hadoop Platform

Posted on:2014-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:W W LiFull Text:PDF
GTID:2268330401973693Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology, user’s data based on the Internetpresent an explosive growth. The traditional computer’s architecture deal with big data is verydifficult. Cloud computing is put forward for processing complex big data. Hadoop is an opensource project founded by Apache Software Foundation, based on the cheap computer cluster,it can show remarkable computing power, storage capacity and operation ability. Data miningtechnology has also entered into a rapid development stage.This study based on Hadoop platform, researched several typical data mining algorithm’smain idea based on sufficient study of the mechanism of distributed applications, put forwardseveral plans to transform them into distributed platform, and realized the Hadoop version,which can help data mining researchers carry out better work based on Hadoop.The main research contents of this thesis are as follows:(1) There are three main kinds of algorithm in data mining field: classification, clusteringand association rules mining. Chose one typical algorithm from every kind, based on the datastored by text document, use MapReduce distributed programming framework, analysisedeach algorithm’s operation principle, made reconstruction scheme, for naivebayesclassification algorithm, K-modes clustering algorithm and ECLAT frequent itemsets miningalgorithm, realized their distributed version. All of them can be runned efficiently and stablyon Hadoop platform.(2) According to the unstructured data from Internet, use HiveQL language as a retrievalentrance, based on HBase distributed data warehouse, realized the distributed GAC-RDBclassification algorithm. Using high-level language as an entry point, do not need backgroundknowledge, such as Java or MapReduce, liberated developers free from tedious code, put theirmore efforts into specific business analysis and can finish other similar works more quicklyand effectively.Based on Northwest A&F University’s high performance computing cluster, designedmultiple sets of solutions to validate the reconstruction algorithms’ effectiveness and Hadoopplatform’s efficiency, drew curves according to the experiment data and made comprehensiveanalyses. Results showed that on the premise of guarantee algorithm’s efficienctiveness and accuracy, MapReduce programming framework can improve the efficiency of algorithm,decrease the time to process data; HiveQL query language can decrease the program’sdevelopment cycle, deal with all kinds of data stored in the distributed database moreconvenient.
Keywords/Search Tags:cloud computing, data mining, hadoop, mapreduce, hbase
PDF Full Text Request
Related items