Font Size: a A A

An Improved Algorithm Of Association Rules Based On The Spark

Posted on:2018-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:L YeFull Text:PDF
GTID:2348330536968010Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
More and more abundant data resources and information can be obtained in big data era,How to find unexpected connections between transactions by data integration and scientific analysis is of important significance.Extraction of useful information from large datasets is one of the most important research problems.Association rule mining is one of the best methods for this purpose.Conventional approaches for mining frequent item sets in big data era encounter significant challenges when computing power and memory space are limited.This paper proposes two distributed frequent item set mining algorithms using Spark for big data analytics,offline installing Cloudera Manager5 and CDH5 then verifying their efficiency,cluster and flexibility.Specific research contents are as follows:(1)Spark + IApriori algorithm can be proposed.Association rules Apriori algorithm have problems with large calculation cycle and low algorithm efficiency faced with huge amounts of data in the era of information explosion,data stored in key-value data structures,pruning operation before the items self-joins and changing the terms of judgment to reduce the data on the number of times past have been adopted in the paper,and the algorithm combined with Spark computing framework,an improved algorithm based on the Spark(Spark + IApriori)can be put forward.Experimental results show that the Spark + IApriori algorithm has a good data scalability and speed ratio than Apriori.(2)SIFP algorithm can be proposed.FP-growth algorithm improves the mining efficiency through the a simple structure FP-tree,but the overhead of FP-tree memory space is large,what's worse,with the coming of the big data era,FP-growth algorithm's limitations are becoming more prominent when confronted with mining large-scale data.According to the inherent defects of FP-growth algorithm,Hash Map structure to implement the key-value data structure store and Flag variable determining whether FP-tree is a single path can be added to Table header.Then the data is divided into blocks to prevent the FP-tree from overrunning.This paper presents a distributed SIFP algorithm based on Spark framework and improved FP-growth algorithm from Header Table and FP-tree.The results of tests show that compared to the Spark + IApriori algorithm,the SIFP algorithm have better performance in terms of speed.
Keywords/Search Tags:Association rules, Apriori, Map Reduce, Hadoop, Spark
PDF Full Text Request
Related items