Parallel Frequent Itemset Mining Based On MapReduce

Posted on:2013-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:P J Xie

Full Text:PDF

GTID:2298330434475679

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The era of "big data" is coming. All kinds of data including web logs, social network data, records of telephone communications, meteorological records etc. are growing explosively. Digging out valuable information from massive data by means of data mining technology has been a hot spot in the research of information technique. Frequent itemset mining(FIM),as the main step in association rules mining,plays an essential role in mining correlations,causality,sequential patterns and many other significant data mining tasks.FIM can generate the frequent patterns from small data effectively. Unfortunately, as the volume of dataset gets larger day by day, the FIM algorithm including Apriori-like and FPGrowth-like algorithm become ineffective due to too huge resource requirement, parallel computing is a very effective solution.The traditional parallel computing technology includes OpenMP、MPI and so on. MapReduce is a distributed computing framework proposed by Google.Compared with traditional model of parallel computing, it can complete the parallel processing of computational tasks,partition the data and tasks automatically.It takes care of the details of data storage,data communications,fault-tolerant.so,it has the advantages of good scalability,simple programming,good fault tolerance.In this paper, we summarizes the parallel methods of FPGrowth, analyses the parallel FPGrowth algorithm based on MapReduce (we call it PFP).PFP projects the data to different nodes according Glist, then runs the local FPGrowth on each node independently. In such a way PFP breaks the bottleneck of memory and computing capability.Then,we improve the grouping strategy in PFP,propose a balanced parallel FPGrowth algorithm(we call it balanced PFP).At the end,we implement the PFP and balanced PFP algorithms in Hadoop platform,compare them on the availability and performance.The result of experiment show that both PFP algorithms have good parallel performance when the dataset is huge;Especially when the local FPGrowth takes most of the time of the program,banlanced PFP algorithm can get better performance.

Keywords/Search Tags:

Big Data, Association Rules, Frequent Itemset Mining, Parallel Computing, MapReduce, Parallel FPGrowth, Load Balancing

PDF Full Text Request

Related items

1	Research On Parallel Frequent Itemset Mining Algorithm Based On MapReduce
2	Research Of Parallel Frequent Itemset Mining Algorithm Based On MapReduce
3	Research On Parallelization And Load Balancing Of Frequent Pattern Mining Algorithm Based On MapReduce
4	Association Rules Research And Analysis Of Parallel Algorithms
5	Data-Mining Methods Study And Its Application In Tranditional Chinese Prescription Compatibility Analysis
6	Research On Frequent Itemset Mining Algorithm And Its Parallelization Based On Spark
7	The Research And Application Of Association Rules Mining Algorithms Based On Directed Itemset Graph
8	Mining Association Rules For Alarm Data In Businesss Supporting Network Based On Parallel FPGrowth Algorithm
9	Research For Association Rules Algorithm On Big Data
10	Research And Application Of PVI Algorithm On Spatial Data Mining