Font Size: a A A

Parallel Association Rules Algorithm Based On Hadoop

Posted on:2013-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:L J ChenFull Text:PDF
GTID:2298330434975727Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of contemporary computer technology, information technology and the Internet, there are more and more methods of collecting data, and the volume storage of all kinds of information is showing explosive growth. The volume storage of a large database has reached TB, even PB. Faced with these massive data, it has become an urgent problem to analysis them effectively and rapidly. The traditional serial algorithm can no longer meet the demand for timely processing, so it is necessary to study the Parallel Mining Algorithms. In this context, Parallel Data Mining comes out. In Data Mining, Association Rule Mining Algorithm is a very important research area. Hadoop,a cloud computing platform,has attracted more and more attention of researchers in recent years. It will be a very important research area with the combination of Hadoop platform and the Associated Rules Algorithm in the future.There are two major bottlenecks in Association Rule Algorithm, which are great calculation and much I/O occupancy time. With Hadoop’s emergence, the two shortcomings can relatively be eased. The study of the Parallel Association Rule Algorithm began at an early age, and in recent years there are extensive researches on the Parallel Association Rule Algorithm and related issues at home and abroad. Agrawal has put forward to CD (count distribution), the DD (data distribution)and CaD (candidate distribution) the three Apriori Parallel Association Rules Algorithmin the three ways, the CD Algorithm is to send the candidate set to each computing node, then each node scans the local data set to get the current count, and finally get the degree of global support by synchronizing them. The DD Algorithm is to divide the candidate set and then transmit them to each node, and each node scans the global database to get the local candidate sets spending counts, and finally get all the candidate sets of the global support counts by communicating them. The CaD Algorithm is a combination of CD algorithm and the DD algorithm, which tries to skip the step of synchronizing.This paper focuses on the three common parallel algorithms in Apriori algorithm and put them into practice based on the MapReduce programming mod. Moreover, a more detailed theoretical analytic comparison is given.In addition, two optimization strategies are provided concerning the shortcomings of the CD Algorithm, which are excessive key-values, too many iterations leading to bad performance and so on. The two strategies are buffering the calculation instead of outputting the key value using RAM, and combining low sets of iterations to reduce iteration times. In addition, as for the improved Apriori algorithm, at the end of it, the thesis also introduces two optimization algorithms based on data partitioning.After the introduction of these optimization strategies, this thesis uses different types of data sets to test the actual performances and to compare with the original stand-alone algorithm. The experiments reveal that these optimization strategies on the data set show better performances compared to the original algorithm.
Keywords/Search Tags:Cloud computing, MapReduce, Data Mining, Association rules, Parallelalgorithm
PDF Full Text Request
Related items