Font Size: a A A

Parallel Association Rules Algorithm Based On Hadoop

Posted on:2012-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:C L YuFull Text:PDF
GTID:2178330338951832Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In data mining, association rule mining is a very important research direction. Association Rules dealing with the object which are large databases, and computation and I / O volume is very large. The data is usually large database land even reach TB or PB. Handle such large datasets, serial algorithm can not meet the requirements in time, and therefore the study for the parallel algorithm is necessary. Traditional MPI-based parallel computing generally realized. MPI(Message Passing Interface) implementation on the platform can not handle node failure and node failure for the cluster formed by the average computer is hard to avoid.In 2004, Google's MapReduce framework can deal with node failure. MapReduce is a major cloud computing infrastructure. MapRedue divide the data into many blocks and map simultaneously start multiple parallel calculations. Hadoop is open source implementation of MapReduce framework. In this paper, give our new association rules based on Hadoop parallel algorithm.We propose a new algorithn on parallel association rules association which is developed from CD(Count Distribute) alogrithm. Our improving, mainly from the introduction of the candidate set of frequent item sets are calculated only once the main process of calculation, the frequency of the candidate set of statistics is also calculated only once a primary process.To evaluate the performance of the algorithm, we write a Hadoop-based parallel association rules program for mining a dataset. Build a basic Hadoop platform. Capacity by changing the system configuration and map data set size, performance evaluation calculation. The results show that the Hadoop parallel association rules in dealing with large scale dataset have an advantage. In dealing with small dataset, because each computing cluster deployment task to take some time, more serious waste of computing resources, the association rules based on Hadoop parallel algorithm is not suitable for the calculation of small-scale dataset. Hadoop platform itself can handle the node failure, parallel association rules based on Hadoop can prevent the node failure. Monitoring the output from the test run, parallel association rules based on Hadoop dynamic load balancing algorithm is done.Theory and experiment show that the parallel association rules algorithm based on Hadoop can handle node failure, can do dynamic load balancing, to adapt to large scale data sets mining association rules.
Keywords/Search Tags:Hadoop, Cloud compute, Parallel, Association rules
PDF Full Text Request
Related items