Font Size: a A A

Research Of FP-Growth Algorithm Based On Spark

Posted on:2020-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:H XieFull Text:PDF
GTID:2428330578454186Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of data volume in the information age,people are increasingly keen to find valuable information from massive data.Data mining technology is becoming more and more mature,and data mining theory and algorithms are improving day by day.With the explosive growth of data volume,the requirements of computer operation for computer memory are getting higher and higher.The FPGrowth algorithm itself has the disadvantages of complex logic and multiple iterations.It is difficult to complete the task of mining massive data.This requires the development of entirely new algorithms or improvements to traditional algorithms.Based on the Spark parallel computing framework,this paper proposes an improved strategy of FP-Growth algorithm from the aspects of storage and grouping,which effectively improves the performance of the algorithm.The main work is as follows:First,the improvement of storage strategies.Spark is a memory-based parallel computing framework that stores intermediate results in RDD.In the face of massive data,when RDD is unable to satisfy all intermediate results,it releases the RDD that is not needed temporarily and recalculates it when needed.Based on the characteristics of Spark itself,this paper proposes a method for caching the intermediate calculation results.For the problem that the conditional pattern base needs to be repeatedly calculated for the partitioned transaction set,the partitioned transaction set is cached;for the problem that the association rule needs to be repeatedly calculated for the frequent item set,each FP-Tree is performed.Frequent itemsets generated during mining are cached.By buffering the above intermediate results,the overhead caused by double counting is effectively avoided.Second,the improvement of the grouping method.In parallel computing,the load of each node determines the computing time of the node,and the overall running time depends on the node with the largest load.Based on the idea of balancing load on each computing node,an improved strategy of grouping is proposed.The strategy uses the position of the frequent 1-item concentration items in descending order as the weight indicator of the load quantity,and performs equalization grouping to balance the load of each node to a certain extent.This solves the problem of delaying the overall running time due to different load amounts of each node,reducing the overall running time and improving the efficiency of the algorithm.Finally,the paper implements the improved algorithm based on Spark.The experimental results show that the proposed improved strategy effectively improves the running speed of the FP-Growth algorithm.
Keywords/Search Tags:Association rule, FP-Growth, Spark, Parallel Computing
PDF Full Text Request
Related items