Font Size: a A A

Research And Application Of Parallel FP-Growth Algorithm Based On Spark

Posted on:2019-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LuoFull Text:PDF
GTID:2348330542489087Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The efficiency of single data mining algorithm is not very good when it is to be use to analyze massive data.The algorithm can be effectively improved the efficiency by using the distributed technology of big data to reimplement the algorithm.Spark,based on memory computing,is a distributed framework,which is used for analyzing big data.Because Spark caches the intermediate results of the algorithm in the node memory,it can reduce the IO operation.Spark is much faster than the Hadoop framework.In recent years,FP-Growth is an algorithm that has been put forward and widely used in the field of association rules.Because the FP-Growth algorithm uses the FP-Tree structure in the memory cache according the iterative procedure.When FP-Growth is used to mine large amounts of data,FP-Growth will run into memory bottlenecks.This paper proposes to parallelize the FP-Growth algorithm based on Spark,named SpaFP.The SpaFP algorithm has faster operation speed.The SpaFP algorithm does not consider the equilibrium in the process of grouping.The entire operation time of a node may be too long.The algorithm of the header table is an array,and the time complexity of iterative FP-Tree construction is high.In order to improve the operational efficiency of SpaFP,this paper proposes an optimal algorithm,named EHSpaFP.It is optimized in two aspects as follows.(1)a balanced grouping strateg that loads the largest item in the smallest load group;(2)the new FP-Growth header table structure can quickly access the address of the element by adding HashMap structure,which reduces the time complexity.The proposed algorithm for a text topic,combining topic model LDA with EHSpaFP algorithm,is practised in this paper,which can understand potential knowledge association in text information.This paper analyzes more than 10000 articles on the "The Belt and Road",and gains the subject knowledge description results.
Keywords/Search Tags:Spark, association rules, FP-Growth, Parallelization
PDF Full Text Request
Related items