Font Size: a A A

Parallel Association Rules Mining Based On Distributed Framework

Posted on:2020-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:N XieFull Text:PDF
GTID:2428330623465362Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Association rule mining is a common method for extracting valuable information from big data.It aims to find frequently occurring items and high correlation information from the data.The current data volume has far exceeded the processing power of the single-machine algorithm.At the same time,the traditional parallel association rule mining method has the problems of large I/O overhead,poor scalability,low computational efficiency and high computing resources.Therefore,Aiming at the above problems,a parallel association rule mining algorithm based on distributed framework is proposed,which is based on Apriori algorithm and FP-Growth algorithm and combines data structures such as Bloom filter and Hash tree.Firstly,based on Hadoop-MapReduce framework and Bloom filter,a parallel mining frequent set algorithm P-FIM is proposed,which only needs two MapReduce processes.At the same time,by reducing the number of MapTasks,streamlining transaction sets without generating global candidate sets and effectively reducing I/O overhead,the computing efficiency is improved.Secondly,a dynamic association rule mining algorithm D-Apriori based on Spark Distributed Framework and Bloom filter and Hash tree is proposed to mine frequent sets of frequent data iteratively.Dynamic adaptive optimization method is used to select mining patterns with higher computational efficiency,so as to maximize computational efficiency.The experimental results show that the effectiveness of the two algorithms is validated by the evaluation indexes of several parallel algorithms.The two algorithms have good computational efficiency by comparing with the four mainstream algorithms based on different support degrees and data sets.In addition,the two algorithms are implemented based on Spark and Hadoop,respectively.The improvement effect of the two frameworks on the algorithm is observed,and both of them can be fast.Mining large data sets,Spark has a larger increase in the iterative algorithm D-Apriori,and Hadoop is more suitable for the P-FIM algorithm with high memory requirements.This thesis has 43 figures,13 tables and 63 references.
Keywords/Search Tags:Association Rules, Distributed Framework, Data Structure, Apriori, FP-Growth
PDF Full Text Request
Related items