Font Size: a A A

Research On Optimization Of Association Rule Apriori Algorithm And Its Parallelization Based On Spark

Posted on:2017-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:M J YanFull Text:PDF
GTID:2428330569998604Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with the speedy development of the computer technology and the extensive application of the Internet,especially maturity and wide application of Web 2.0,data volume presents explosive increasing.The traditional method of data analysis suffer great limitation,so the data mining has been given more and more attention in recent years.Association rules is an important branch of data mining domain.At present,lots of algorithms for mining association rules have been brought forward.The most famous algorithms are Apriori and its transfiguration,but there are still many deficiencies,such as generation of candidate items,repeated scanning of transaction data set,etc.Aiming at some problems existed in the Apriori algorithm,combined with the existing optimization idea,this paper presents the optimization strategy in three aspects with Boolean vector matrix,the elimination of generation of candidate items and the BitSet handling mechanism.We propose an optimization algorithm called I-Apriori which improves significantly the operational efficiency.In addition,aiming at this problem that I-Apriori algorithm cannot satisfy the need of efficiently mining on massive data set,this paper also achieve the parallelization of IApriori algorithm based on Spark,called IABS.To improve the degree of algorithm parallelism,the IABS algorithm mainly uses the cluster resources based on the parallelism mechanism of Spark.In this paper,we achieve the evaluation of the performance of I-Apriori algorithm by comparing it with the running time of the algorithm in various support thresholds.IABS also compares with existing ones,such as YAFIM algorithm.Firstly,we find the characteristic that IABS takes much more time in the first iteration,but less time in later iterations than the YAFIM algorithm,by analyzing the character of IABS and YAFIM.Secondly,we find the factors that influences scalability of nodes,such as the number of RDD partition and so on,by evaluation of data scalability and node scalability of IABS and FAFIM.In general,both I-Apriori and IABS can achieve better performance improvement,which demonstrate the validity of proposed optimization strategies and parallel processing.
Keywords/Search Tags:Apriori algorithm, frequent itemset, Spark, RDD, Parallelization
PDF Full Text Request
Related items