Font Size: a A A

Parallel Data Mining Algorithm Research In Cloud

Posted on:2014-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:S J HuFull Text:PDF
GTID:2268330401965127Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Computer technology, special for network technology has a rapid development,we began to face the rapid growth of the vast amounts of data.In the face of big datahow to deal with the data for scientific research and production service has been aserious problem. To solve this problem, we will use data mining technology which is adiscipline of computer science, mathematics and statistics.Data mining is a veryrealistic direction of research and academic significance.Traditional Data miningalgorithm and its improved algorithm run on single node and use serial computing. Forexample, Apriori algorithm and K-means algorithm is run on single node. When thealgorithms deal with massive data, due to the limited stand-alone resources (such asess data mining tasks effectively. In order to improve the algorithm of data mining on massive data, the best way is to store the data and runprogram on multiple nodes, and so will be able to take advantage of the resources ofmultiple machines together to complete the task of data mining.To solve the defects of Apriori algorithm and K-means algorithm, we put forwardMC_Apriori and CK. This paper mainly focuses on the follow contents:(1) The traditional Apriori algorithm and K-means algorithm for understandingand instructions, analyzes the defects of the traditional algorithm, and study theexisting some improved algorithm.(2) Research Cloud computing s history and prospect, analysis the parallelcomputing ability of the two platforms (Hadoop and Spark) in the massive computing,study their application in the Data Mining.(3) To the transaction database multiple full scanning and the big candidate setproblem of Apriori algorithm, this paper puts forward the improved algorithmMC_Apriori which use Boolean matrix and transaction weights. MC_Apriori convertTransaction database data into a Boolean matrix and the calculation of support intovector operation, the repeat Transaction can be compressed by weight.(4) To solve the initialization cluster center random problem and K value problemof K-means algorithm, Canopy is used in the modified K-means algorithm in this paper. Firstly CK use the Canopy algorithm to cluster the date into Canopy speedy, then useK-means algorithm on each Canopy.(5) Two improved algorithms are applied to the Hadoop and Spark platform andparallelized by cloud computing technology. Improve the application ability of theimproved algorithms in the cloud environment.
Keywords/Search Tags:Hadoop, Cloud computing, Mapreduce, Spark, Data mining
PDF Full Text Request
Related items