Font Size: a A A

Research On Parallel Frequent Itemsets Mining Algorithm

Posted on:2019-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z H HeFull Text:PDF
GTID:2428330545476039Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Frequent itemsets mining is used to discover frequent patterns in data item sets.It is widely used in commodity association analysis and supermarket promotion strategy decision.However,the time complexity of traditional frequent item set mining algorithms is high,so many domestic and foreign scholars are committed to improving the performance of related algorithms.With the advent of the era of big data,traditional frequent itemsets mining algorithms are often limited by the limited computing power and storage capacity of a single computer,and cannot meet the urgent needs of users for dealing with larger-scale frequent item set mining problems.With the development of big data technology,frequent item set mining algorithms based on Hadoop platform have greatly improved time efficiency compared to stand-alone algorithms.The latest in-memory computing framework Spark has parallel computing compared to the Hadoop platform.Spark has become the mainstream framework for building distributed computing platforms in the industry.Therefore,this paper combines the Spark framework with frequent item set mining algorithms to study how to implement parallel frequent itemset mining algorithms under the Spark platform to improve the time efficiency of frequent item set mining algorithms.The main work of this article includes the following aspects.(1)Learning the classic frequent item set mining algorithms,including Apriori algorithm,DHP algorithm,FP-Growth algorithm.(2)In the process of generating K+1 frequent itemsets from K frequent itemsets by Apriori algorithm,it is necessary to repeatedly detect whether the two items in the item set are frequent or not,and proposes an Apriori improvement based on two-dimensional tables.The algorithm uses a two-dimensional table to record whether the two subsets are frequent,thereby reducing the time needed to determine if the two subsets frequently require multiple scans of the transaction database.Experimental results show that the improved Apriori algorithm proposed in this paper can significantly reduce the running time of the algorithm compared with the original Apriori algorithm.(3)Learned and studied related technologies of Spark framework.Based on Linux operating system,using Java combined with Scala development language,a distributed development environment based on Spark platform was built to implement the proposed algorithm for parallel frequent itemsets mining.(4)A number of duplicate candidate sets are generated for DHP when the number of itemsets in the first statistics bucket,and a compression DHP algorithm based on Spark single node is proposed.The algorithm replaces the number of duplicate items with the image digital form.,and it is implemented when scanning the transaction database for the first time.It is proved through actual experiments that the proposed compression improvement algorithm is significantly lower in time complexity than the single-node DHP algorithm without compressed DHP.(5)Based on the deficiencies of a single node with only one computing unit,the cluster-based Spark distributed computing framework is studied.The distributed DHP algorithm and distributed FP-Growth algorithm are implemented using the Spark multi-node cluster distributed architecture,which fully utilizes the advantages of the cluster.The experimental results on the simulated data and the UUC data set Pumsb star show that the cluster-based parallel strategy has better time efficiency than the single node-based parallel environment.
Keywords/Search Tags:Spark platform, Association rules, Frequent itemsets, Mining algoriths
PDF Full Text Request
Related items