Font Size: a A A

Parallelizable Algorithms Research Of Association Rules Mining

Posted on:2017-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiuFull Text:PDF
GTID:2348330491963102Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Every walk of life has accumulating a large amount of data with the continuous development of information technology. To transform the data into useful knowledge, the people invent the data mining technology. But traditional serial algorithms of data mining technology have a bad efficiency in the face of mass data, parallelizing technique has a high speed development and widespread applications recently, it can effectively enhance the efficiency of algorithm in the face of mass data. So it has become a consensus using parallelizing technique to enhance the efficiency of data mining algorithm.Association rules mining studies the relation of useful items which is an important branch of data mining technology. In general, the people regard the association rules mining algorithms as frequent items mining algorithms because frequent items mining is the most important step of the association rules mining. There are mainly three types of association rules mining basic algorithm:multiple candidate algorithms (Apriori, partition, sampling et al), FP-Growth algorithm (FP-growth, HMine, FPMax, Close+et al) and vertical format algorithms (Eclat, CHARM et al). There are mainly three types of parallelizing technique: CPU-parallel, GPU-parallel and node-parallel(cluster). So the purpose of paper is combineing some association rules mining with some parallelizing technique. The paper discusses the parallel association rules mining algorithms on above platform and introduce the GPU-CUDA parallel computation framework and MapReduce-Spark parallel computation framework in detail based on which the paper introduces two important algorithms.FP-growth algorithm is a kind of frequent items mining algorithm based on memory. But it is not realistic to construct the global frequent pattern tree which based on main memory in the face of mass data or low support. FP-growth algorithm extension scheme split big data set to many small dataset and use FP-growth algorithm mine small dataset to solve the problem. The paper uses the idea of parallel projection to introduce a kind of simple grouping algorithm based on the FP-growth algorithm extension scheme. Base on the considering load balance between nodes, the paper improves the simple grouping algorithm to balance grouping algorithm. the Spark-FP-growth parallel algorithm use above grouping algorithm to split the big dataset and parallelly execution the FP-growth algorithm on every small dataset.the paper also imports the idea of topk aggregation which improve FP-growth algorithm to Topk-FP-growth algorithm. In turn, the parallel algorithm has a good availability and speed performance.the paper also studies the parallelization algorithm on the Spark-GPU platform and intoduce a Spark-GPU-Apriori algorithm.the algorithm still use the above grouping model and ivoking the GPU-CUDA apriori algorithm according to Spark RDD pipe interface.the algorithm cover a variety of parallel level and organically combine Spark and GPU parallel computation framework.Make the Spark-Apriori algorithm and Spark-mblib-FP-growth algorithm as benchmark test algorithm. The paper tests respectively the speed performance and the Scalability performance of Spark-SPFP-growth algorithm, Spark-BPFP-growth algorithm and Spark-GPU-Apriori algorithm. The experiment results show that Spark-FP-growth algorithm has a good speed performance and scalability performance compared to Spark-Apriori algorithm and Spark-mblib-FP-growth algorithm; Spark-GPU-Apriori algorithm has a good speed performance and scalability performance compared to Spark-Apriori algorithm, but worse than Spark-mblib-FP-growth algorithm.
Keywords/Search Tags:Association rules, FP-growth, Spark, GPU, CUDA
PDF Full Text Request
Related items