Font Size: a A A

Research On Parallel Mining Algorithm Of Association Pattern Based On Spark

Posted on:2022-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:J Q DengFull Text:PDF
GTID:2518306551971099Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As one of the classic research directions in the field of data mining,association pattern mining aims to mine the hidden association relationships from the transactions that do not have connections on the surface.Through the analysis of the mining results,we can get inspiration,as a guide for future decision-making.In the era of the Internet,data is being generated all the time.With the increasing amount of data,the useless information in the data also increases,and the problem of knowledge shortage still exists.Therefore,starting from the data itself,the use of effective technology mining available knowledge is still a popular topic for researchers.Although some methods have tried to make a breakthrough in the mining of association patterns,the mining of association patterns for today's massive data still faces great challenges,such as the traditional stand-alone itemsets mining algorithm on large data sets.And in big data environment,parallel itemsets mining algorithm also has some problems,like long running time,uneven load of nodes,large memory occupation and so on.In view of this,this paper aimed at the difficulties and challenges in the process of parallel association pattern mining,and carried out the following research.(1)The superiority analysis of Spark framework combined with the iterative algorithm FP-Growth.Firstly,traditional serial itemsets mining and parallel itemsets mining are compared.The experiments on datasets webdocs,mushroom and accidents show that the parallel-based method can effectively solve the problem of large-scale data association pattern mining,and the mining rate is significantly better than serial mining.Then the parallel itemsets mining algorithm based on Hadoop and Spark is further studied.The experiment shows that Spark framework combined with FP-Growth algorithm shows higher mining efficiency and better stability compared with Map Reduce.(2)An improved parallel itemsets mining algorithm Opt-SFPG based on Spark is proposed to quickly mine the association relations of large-scale transactions.Based on the advantages of Spark framework,this algorithm optimizes the FP tree generation scale and node computation amount comprehensively.By experiments on webdocs dataset,the high efficiency of Opt-SFPG algorithm is verified.In order to analyze the algorithm performance more comprehensively,using the dataset T40I10D100 K and webdocs,The double-optimized Opt-SFPG algorithm is compared with Eq-SFPG algorithm based on storage optimization and dynamic balancing grouping,Ht-SFPG algorithm based on FP tree item head table optimization and pruning,traditional SFPG algorithm,existing IPFP-Growth algorithm and BFPG algorithm from four aspects of data scale,Support values,Speed-Up and node number.Experimental results show that the proposed Opt-SFPG algorithm has faster mining efficiency and better parallelism.(3)A method based on Opt-SFPG algorithm combined with Spark-LDA model is proposed to realize the mining of subject word association relationship in large-scale text data.Firstly,the dimension-reduced text topic description matrix is obtained by using the Spark-LDA model,and then the Opt-SFPG algorithm is used to mine the text topic dataset.Finally,the implicit text topic information is obtained by in-depth analysis of the mined correlation relations.The results show that this method is feasible and efficient in the text of newspapers and periodicals with "The Belt & Road" keyword and title.
Keywords/Search Tags:Hadoop, Spark, FP-Growth algorithm, Association mode, Text topic mining
PDF Full Text Request
Related items