Font Size: a A A

Research Of Parallel Frequent Itemset Mining Algorithm Based On MapReduce

Posted on:2017-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:C G JiangFull Text:PDF
GTID:2308330488485678Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Association rule has been a important research hotspot in the field of data mining, and it has been widely used in many fields. With the coming of the era of Web2.0, especially the riset of social networking, the accumulation of data showed an exponential growth trend. How to efficiently use association rules on massive data sets is an important task of data mining.In this paper,a new parallel FP-Growth based on PFP algorithm is proposed. The traditional algorithm of Frequent-pattern Growth(FP-Growth for short) is a single machine algorithm based on memory.In face of massive data processing, there will be some problems as short of memory or long execution time, so we need to parallelize the FP-Growth algorithm. The existing parallel FP-Growth algorithm have solved the problems such as the partition of transaction dataset, it can guarantee that each transaction dataset are independent after the partition, but there are still many problems such as too many iterations in the process of FP-tree mining on single node and low efficiency.What’s more, it did not consider the load balancethe when master node partition dataset to the child node. Therefore, how to make the parallel FP-Growth algorithm with high efficiency and load balancing is the problem to be solved in this paper. MapReduce is a distributed parallel programming framework proposed by Google, which allows a computer cluster to process distributed large data sets. This paper designed a new parallel FP-Growth algorithm by using pruning strategy on PFP which is the original algorithm based on the MapReduce.By merging paths witch is not frequent in FP-tree, we can reducing the number of iterations of the part of the branch to improve the mining efficiency. In addition, the load balancing strategy is used when the master node partition dataset to the sub nodes. Firstly, the load of the frequent items is estimated, and then the frequent items in Flist are grouped as Glist by using the load balancing algorithm. Through the combination of these two strategies, this paper designs a new parallel FP-Growth algorithm, and doing the experiments to prove that the algorithm have improved the efficiency when compared with original algorithm.In this paper, the parallel FP-Growth algorithm is applied on the recommendation of the micro-blog friend, and designed of a new algorithm of Micro-blog friends recommendation based on association rules. Previous social network friends recommendation is often based on common friends of a user, but the Micro-blog not only has the properties of social networks, it pay more attention to is the spread of new tings, so the potential friends will not only pay attention to the same people but pay more focus on the same thing (present as the reposting and commenting the micro-blog). This paper combines these two points, using sina micro-blog’s open API to get the attention data between users and micro-blog’s.Here the "attention" as a transaction, the user is regarded as the transaction, the set of all transactions is regarded as a transaction dataset.By useing parallel FP growth algorithmon on this dataset, then frequent 2-itemsets are extracted. Rank the users according to the frequency from high to low and get the first N users as the friends of recommendation.By doing the experiments we find that the algorithm which porposted in this paper performed better in accuracy rate and recall rate when compered with Friend-of-friend algorithm which is based on common friends.
Keywords/Search Tags:association rule, parallel computing, FP-Growth, MapReduce, friend recommendation
PDF Full Text Request
Related items