Font Size: a A A

Parallel Frequent Itemset Mining Based On MapReduce

Posted on:2013-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:P J XieFull Text:PDF
GTID:2298330434475679Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The era of "big data" is coming. All kinds of data including web logs, social network data, records of telephone communications, meteorological records etc. are growing explosively. Digging out valuable information from massive data by means of data mining technology has been a hot spot in the research of information technique. Frequent itemset mining(FIM),as the main step in association rules mining,plays an essential role in mining correlations,causality,sequential patterns and many other significant data mining tasks.FIM can generate the frequent patterns from small data effectively. Unfortunately, as the volume of dataset gets larger day by day, the FIM algorithm including Apriori-like and FPGrowth-like algorithm become ineffective due to too huge resource requirement, parallel computing is a very effective solution.The traditional parallel computing technology includes OpenMP、MPI and so on. MapReduce is a distributed computing framework proposed by Google.Compared with traditional model of parallel computing, it can complete the parallel processing of computational tasks,partition the data and tasks automatically.It takes care of the details of data storage,data communications,fault-tolerant.so,it has the advantages of good scalability,simple programming,good fault tolerance.In this paper, we summarizes the parallel methods of FPGrowth, analyses the parallel FPGrowth algorithm based on MapReduce (we call it PFP).PFP projects the data to different nodes according Glist, then runs the local FPGrowth on each node independently. In such a way PFP breaks the bottleneck of memory and computing capability.Then,we improve the grouping strategy in PFP,propose a balanced parallel FPGrowth algorithm(we call it balanced PFP).At the end,we implement the PFP and balanced PFP algorithms in Hadoop platform,compare them on the availability and performance.The result of experiment show that both PFP algorithms have good parallel performance when the dataset is huge;Especially when the local FPGrowth takes most of the time of the program,banlanced PFP algorithm can get better performance.
Keywords/Search Tags:Big Data, Association Rules, Frequent Itemset Mining, Parallel Computing, MapReduce, Parallel FPGrowth, Load Balancing
PDF Full Text Request
Related items