Parallel Data Mining Algorithm Research In Cloud

Posted on:2014-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:S J Hu

Full Text:PDF

GTID:2268330401965127

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Computer technology, special for network technology has a rapid development,we began to face the rapid growth of the vast amounts of data.In the face of big datahow to deal with the data for scientific research and production service has been aserious problem. To solve this problem, we will use data mining technology which is adiscipline of computer science, mathematics and statistics.Data mining is a veryrealistic direction of research and academic significance.Traditional Data miningalgorithm and its improved algorithm run on single node and use serial computing. Forexample, Apriori algorithm and K-means algorithm is run on single node. When thealgorithms deal with massive data, due to the limited stand-alone resources (such asess data mining tasks effectively. In order to improve the algorithm of data mining on massive data, the best way is to store the data and runprogram on multiple nodes, and so will be able to take advantage of the resources ofmultiple machines together to complete the task of data mining.To solve the defects of Apriori algorithm and K-means algorithm, we put forwardMC_Apriori and CK. This paper mainly focuses on the follow contents:(1) The traditional Apriori algorithm and K-means algorithm for understandingand instructions, analyzes the defects of the traditional algorithm, and study theexisting some improved algorithm.(2) Research Cloud computing s history and prospect, analysis the parallelcomputing ability of the two platforms (Hadoop and Spark) in the massive computing,study their application in the Data Mining.(3) To the transaction database multiple full scanning and the big candidate setproblem of Apriori algorithm, this paper puts forward the improved algorithmMC_Apriori which use Boolean matrix and transaction weights. MC_Apriori convertTransaction database data into a Boolean matrix and the calculation of support intovector operation, the repeat Transaction can be compressed by weight.(4) To solve the initialization cluster center random problem and K value problemof K-means algorithm, Canopy is used in the modified K-means algorithm in this paper. Firstly CK use the Canopy algorithm to cluster the date into Canopy speedy, then useK-means algorithm on each Canopy.(5) Two improved algorithms are applied to the Hadoop and Spark platform andparallelized by cloud computing technology. Improve the application ability of theimproved algorithms in the cloud environment.

Keywords/Search Tags:

Hadoop, Cloud computing, Mapreduce, Spark, Data mining

PDF Full Text Request

Related items

1	The Process And Research Of Massive Data Mining Based On Cloud Computing
2	Data Mining Based On Hadoop Platform
3	Research On Massive Digital Image Data Mining Based On Hadoop Cloud Platform
4	Study On Parallel Alogrithm Of Large-scale Numerical Calculation In Cloud Computing Environment
5	Pattern Mining Algorithm On Cloud Computing Platform
6	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
7	Based On The Parallel Implementation Of Multi-node Data Mining Algorithm
8	Research Of Massive Data Processing And Mining In Database Marketing Based On Hadoop
9	Parallel Algorithms Research Based On Hadoop And Hama
10	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment