Font Size: a A A

The Parallelization And Optimization Of Fp-Growth Algorithm Based On Spark

Posted on:2016-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y G FuFull Text:PDF
GTID:2348330479954725Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology especially the network technology, data processed by the computer are also rapidly increased. However, traditional data mining algorithm use serial computing and run on single node. When dealing with vast amounts of data, however, due to the limitation of resources of hardware, these algorithms are not efficient for data mining. In order to improve the processing capacity of the traditional data mining algorithms for large data, we need to parallelize these algorithms, combined with distributed technology, take advantage of the resources of multiple machines together to mining data.Apache Spark is a big data parallel computing framework which is based on memory computing. Spark focus on large data processing, and through caches the intermediate results in memory to reduce disk I/O, because of that its performance has been increased by an order of magnitude than the MapReduce framework which is based on disk. So the parallel computing efficiency and performance of parallel algorithm based on Spark which is better than Hadoop will further improved.Fp- Growth algorithm is a widely used algorithm for mining frequent patterns, highly efficient when mining frequent patterns. But the algorithm will encounter memory bottlenecks when dealing with huge amounts of data. Take advantage of Spark, the classic Fp-Growth could be parallelized on Spark. And the grouping strategy of existing parallel Fp-Growth also can be improved, so the load balance of the parallel Fp-Growth can be improved, and then the parallel Fp-Growth algorithm based on Spark and another parallel algorithm based on Spark which uses the load balancing strategy are compared with the parallel Fp-Growth based on Hadoop. Experimental results show that the parallel Fp-Growth based on Spark has great advantage when dealing with large data, and its capacity and performance to deal with huge amounts of data are also improved.
Keywords/Search Tags:Spark, Data mining, Fp-Growth algorithm, Parallelization, Load balance
PDF Full Text Request
Related items