Font Size: a A A

Research On Parallel Frequent Itemset Mining Algorithm Based On MapReduce

Posted on:2022-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2518306524498604Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,a large number of data that increasingly shows the features of huge number,complex structure,numerous types,valuable and so on,is generated and accumulated.Exploring useful knowledge and mining valuable information in the data have always been the main work of traditional data mining.Moreover,the main focus is to analyze and discover the association rule mining technology between items in the data set is a vital branch of data mining.Faced with the massive amounts of data,it is no longer possible to meet people's needs for analysing and processing big data by simply improving computer hardware.Therefore,parallelization technology has received more and more attention and research.One of the main directions of current research is to improve the traditional association rule mining algorithm and combine it with the distributed computing model.At present,most of the parallel frequent itemset mining algorithms that has been proposed are the migration of Aproiori,FP-Growth and Eclat algorithms to the Map Reduce computing model to achieve parallelization,which is belong to single parallel algorithm.Although it has made great achievements in big data mining,it has inherent defects.In view of this,this article starts with the research of the hybrid parallel frequent itemset mining algorithm,and improves the mining efficiency of the algorithm by improving the existing MRPrePost algorithms.The main work of this paper is as follows:Aiming at the problem of unbalanced load,bad efficiency of N-list merge and redundant search for each node based on parallel frequent itemset mining algorithm MRPrePost,this paper proposes a hybrid parallel frequent itemset mining algorithm based on N-list,named HP-FIMBN.Firstly,a load estimation function(LE)is designed to calculate the load of each item in F-list,and a grouping method based on greedy strategy(GM-GS)is proposed,which not only dealt with the problem of unbalanced load in the process of data partitioning,but also decreased the size of sub-PPC-tree on local node;Secondly,in order to accelerate the completion of the merging of two N-list structures,it designs an early abandon strategy(EAS),which can not only efficiently avoid invalid calculations in the merging process,but also do not need to traverse the initial N-list structure to get the final result,so as to improve the mining efficiency of the algorithm.Finally,the set-enumeration tree is adopted as the search space,and a superset equivalent strategy(SES)is proposed to avoid redundant search in the mining process and generate the final mining results.The experimental results show that the modified algorithm has better performance on mining frequent itemsets in a big data environment.In view of the problem of excessive time and space complexity and unbalanced load for each node based on the parallel frequent itemset mining algorithm MRPrePost,this paper proposes an optimization parallel frequent itemset mining algorithm based on Map Reduce,named PFIMD.Firstly,a data structure called DiffNodeset is adopted,which effectively avoid the defect that the N-list cardinality gets very large in the MRPrePost algorithm.Secondly,2-way comparison strategy is designed to speed up the DiffNodeset generation of 2-itemsets and reduce the time complexity of the algorithm.Finally,the steps of improved algorithm are parallelized by using the cloud computing platform named Hadoop and the programming model named Map Reduce.Otherwise,in order to achieve a uniform grouping of each item in F-list,a load balancing strategy based on dynamic grouping is proposed,which solves the problem of uneven load of each node in the cluster.The experimental results show that the modified algorithm not only overcomes the shortcoming of MRPrePost in the big data environment,but also greatly reduces the time and space complexity.
Keywords/Search Tags:association rules, frequent itemset, N-list, DiffNodeset, parallelization
PDF Full Text Request
Related items