An Improved Algorithm Of Association Rules Based On The Spark

Posted on:2018-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:L Ye

Full Text:PDF

GTID:2348330536968010

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

More and more abundant data resources and information can be obtained in big data era,How to find unexpected connections between transactions by data integration and scientific analysis is of important significance.Extraction of useful information from large datasets is one of the most important research problems.Association rule mining is one of the best methods for this purpose.Conventional approaches for mining frequent item sets in big data era encounter significant challenges when computing power and memory space are limited.This paper proposes two distributed frequent item set mining algorithms using Spark for big data analytics,offline installing Cloudera Manager5 and CDH5 then verifying their efficiency,cluster and flexibility.Specific research contents are as follows:(1)Spark + IApriori algorithm can be proposed.Association rules Apriori algorithm have problems with large calculation cycle and low algorithm efficiency faced with huge amounts of data in the era of information explosion,data stored in key-value data structures,pruning operation before the items self-joins and changing the terms of judgment to reduce the data on the number of times past have been adopted in the paper,and the algorithm combined with Spark computing framework,an improved algorithm based on the Spark(Spark + IApriori)can be put forward.Experimental results show that the Spark + IApriori algorithm has a good data scalability and speed ratio than Apriori.(2)SIFP algorithm can be proposed.FP-growth algorithm improves the mining efficiency through the a simple structure FP-tree,but the overhead of FP-tree memory space is large,what's worse,with the coming of the big data era,FP-growth algorithm's limitations are becoming more prominent when confronted with mining large-scale data.According to the inherent defects of FP-growth algorithm,Hash Map structure to implement the key-value data structure store and Flag variable determining whether FP-tree is a single path can be added to Table header.Then the data is divided into blocks to prevent the FP-tree from overrunning.This paper presents a distributed SIFP algorithm based on Spark framework and improved FP-growth algorithm from Header Table and FP-tree.The results of tests show that compared to the Spark + IApriori algorithm,the SIFP algorithm have better performance in terms of speed.

Keywords/Search Tags:

Association rules, Apriori, Map Reduce, Hadoop, Spark

PDF Full Text Request

Related items

1	Research On The Apriori Algorithms For Meteorological Data Association Rules Analysis Based On Cloud Computing
2	Research On Association Rules Algorithm Based On Hadoop
3	Mining Association Rules Algorithm Analysis Based On Hadoop
4	The Study On The Recommending Methods For Online Travel Websites Association Rules
5	Research On Apriori Algorithms Based On Distributed Platform
6	Research And Application Of Association Rule Algorithm Based On Spark Platform
7	The Research Of Quantitative Association Rules Data Mining Based On Hadoop
8	Research And Application Of Association Rules Algorithm Based On MapReduce
9	Research On Improvement Of Apriori Algorithm Based On Hadoop Platform
10	The Study Of The Improvement And Transplantation Of Apriori Algorithm Based On Hadoop