Font Size: a A A

Research On Parallel Association Rule Mining Algorithm Based On Hadoop Platform

Posted on:2018-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2348330533462722Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The explosive growth of data scale brings challenges to traditional computer technology and serial algorithms,but also brings new opportunities for development.Confirming to the trend,big data appears.The big data makes the serial association rule algorithm need to be rewritten.And the parallelization of serial algorithm is imminent,it is application of parallel computing and large data platform that is a good solution.Association rules are used to discover the relationship between information and information,and association rules are an important data mining task.When the traditional Apriori algorithm and FP-Growth algorithm of association rules deal with big data,memory overflow occurs on a single machine.Using Hadoop to study the association rules,which reduces the difficulty of programming and fragment data.Therefore,it is a general trend to study association rules parallel algorithm on Hadoop.Aiming at this problem,this paper carries on the following research:(1)H-Apriori algorithm is studied and improved.Under Big data environment,Apriori serial algorithm can't deal with massive data.The intermediate process of the H-Apriori algorithm generates a large number of key/value pairs of 1,and reads all transactions,resulting in a large number of candidate items and consuming the computation time.In this paper,the database was reconstructed,the reading process was optimized,and an improved algorithm based on Hadoop is proposed to delete the non-frequent items to reduce redundant data.The algorithm effectively reduces the transaction database and counting with hash tree reduces the counting time and improves the efficiency.(2)A load balancing data segmentation improved FP-Growth algorithm based on Hadoop platform is proposed.Under Big data environment,FP-Growth serial algorithm can't deal with massive data,and PFP(Parallel FP-Growth)can't deal with a certain amount of data.The improved algorithm uses load estimation and improved equalization grouping method to overcome the disadvantage that PFP data can not be processed and load is unbalanced.The improved algorithm can effectively balance the load of each node in the cluster and shorten the running time of the whole cluster.A comparative experiment was constructed after building the big data Hadoop platform framework.The effectiveness of the algorithm was verified by authoritative data.The experiments show that the improved algorithm can better adapt to big data,and is more efficient.
Keywords/Search Tags:Big data, association rules, Hadoop, reconstruction, load balancing
PDF Full Text Request
Related items