Parallel Association Rules Algorithm Based On Hadoop

Posted on:2012-09-06

Degree:Master

Type:Thesis

Country:China

Candidate:C L Yu

Full Text:PDF

GTID:2178330338951832

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In data mining, association rule mining is a very important research direction. Association Rules dealing with the object which are large databases, and computation and I / O volume is very large. The data is usually large database land even reach TB or PB. Handle such large datasets, serial algorithm can not meet the requirements in time, and therefore the study for the parallel algorithm is necessary. Traditional MPI-based parallel computing generally realized. MPI(Message Passing Interface) implementation on the platform can not handle node failure and node failure for the cluster formed by the average computer is hard to avoid.In 2004, Google's MapReduce framework can deal with node failure. MapReduce is a major cloud computing infrastructure. MapRedue divide the data into many blocks and map simultaneously start multiple parallel calculations. Hadoop is open source implementation of MapReduce framework. In this paper, give our new association rules based on Hadoop parallel algorithm.We propose a new algorithn on parallel association rules association which is developed from CD(Count Distribute) alogrithm. Our improving, mainly from the introduction of the candidate set of frequent item sets are calculated only once the main process of calculation, the frequency of the candidate set of statistics is also calculated only once a primary process.To evaluate the performance of the algorithm, we write a Hadoop-based parallel association rules program for mining a dataset. Build a basic Hadoop platform. Capacity by changing the system configuration and map data set size, performance evaluation calculation. The results show that the Hadoop parallel association rules in dealing with large scale dataset have an advantage. In dealing with small dataset, because each computing cluster deployment task to take some time, more serious waste of computing resources, the association rules based on Hadoop parallel algorithm is not suitable for the calculation of small-scale dataset. Hadoop platform itself can handle the node failure, parallel association rules based on Hadoop can prevent the node failure. Monitoring the output from the test run, parallel association rules based on Hadoop dynamic load balancing algorithm is done.Theory and experiment show that the parallel association rules algorithm based on Hadoop can handle node failure, can do dynamic load balancing, to adapt to large scale data sets mining association rules.

Keywords/Search Tags:

Hadoop, Cloud compute, Parallel, Association rules

PDF Full Text Request

Related items

1	Parallel Association Rules Algorithm Based On Hadoop
2	Research On Parallel Association Rules Algorithm Based On HADOOP Platform
3	The Research And Implementation Of Parallel Association Rules Algorithm Based On Cloud Environment Data Mining
4	Research On Association Rules Algorithm Based On Hadoop
5	The Parallel Association Rules Algorithm Based On Mapreduce In The Application Of Community Analysis Research
6	Research On Parallel Association Rule Mining Algorithm Based On Hadoop Platform
7	Research For Association Rules Algorithm On Big Data
8	Research Of Parallel Association Rules Algorithm Based On Hadoop
9	Parallel Association Rules Algorithm Based On Hadoop Platform
10	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop