The Parallelization And Optimization Of Fp-Growth Algorithm Based On Spark

Posted on:2016-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y G Fu

Full Text:PDF

GTID:2348330479954725

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of computer technology especially the network technology, data processed by the computer are also rapidly increased. However, traditional data mining algorithm use serial computing and run on single node. When dealing with vast amounts of data, however, due to the limitation of resources of hardware, these algorithms are not efficient for data mining. In order to improve the processing capacity of the traditional data mining algorithms for large data, we need to parallelize these algorithms, combined with distributed technology, take advantage of the resources of multiple machines together to mining data.Apache Spark is a big data parallel computing framework which is based on memory computing. Spark focus on large data processing, and through caches the intermediate results in memory to reduce disk I/O, because of that its performance has been increased by an order of magnitude than the MapReduce framework which is based on disk. So the parallel computing efficiency and performance of parallel algorithm based on Spark which is better than Hadoop will further improved.Fp- Growth algorithm is a widely used algorithm for mining frequent patterns, highly efficient when mining frequent patterns. But the algorithm will encounter memory bottlenecks when dealing with huge amounts of data. Take advantage of Spark, the classic Fp-Growth could be parallelized on Spark. And the grouping strategy of existing parallel Fp-Growth also can be improved, so the load balance of the parallel Fp-Growth can be improved, and then the parallel Fp-Growth algorithm based on Spark and another parallel algorithm based on Spark which uses the load balancing strategy are compared with the parallel Fp-Growth based on Hadoop. Experimental results show that the parallel Fp-Growth based on Spark has great advantage when dealing with large data, and its capacity and performance to deal with huge amounts of data are also improved.

Keywords/Search Tags:

Spark, Data mining, Fp-Growth algorithm, Parallelization, Load balance

PDF Full Text Request

Related items

1	Research And Application Of Parallel FP-Growth Mining Algorithm Based On Cloud Computing Platform
2	Research And Application Of Parallel FP-Growth Algorithm Based On Spark
3	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
4	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
5	Research On Frequent Itemset Mining Algorithm And Its Parallelization Based On Spark
6	Study On FP-growth Algorithm In Pervasive Computing Environment
7	Zone Division And Dynamic Load Scheduling Algorithm Based On Heterogeneous Spark Cluster
8	Research Of FP-growth Data Mining Algorithm
9	Research And Implementation On Efficient Parallel Frequent Itemsets Mining Algorithm Based On Spark
10	CHAID Algorithm Parallelization And Application In Credit Risk Analysis