Research On Parallel Association Rule Mining Algorithm Based On Hadoop Platform

Posted on:2018-08-16

Degree:Master

Type:Thesis

Country:China

Candidate:L Zhang

Full Text:PDF

GTID:2348330533462722

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The explosive growth of data scale brings challenges to traditional computer technology and serial algorithms,but also brings new opportunities for development.Confirming to the trend,big data appears.The big data makes the serial association rule algorithm need to be rewritten.And the parallelization of serial algorithm is imminent,it is application of parallel computing and large data platform that is a good solution.Association rules are used to discover the relationship between information and information,and association rules are an important data mining task.When the traditional Apriori algorithm and FP-Growth algorithm of association rules deal with big data,memory overflow occurs on a single machine.Using Hadoop to study the association rules,which reduces the difficulty of programming and fragment data.Therefore,it is a general trend to study association rules parallel algorithm on Hadoop.Aiming at this problem,this paper carries on the following research:(1)H-Apriori algorithm is studied and improved.Under Big data environment,Apriori serial algorithm can't deal with massive data.The intermediate process of the H-Apriori algorithm generates a large number of key/value pairs of 1,and reads all transactions,resulting in a large number of candidate items and consuming the computation time.In this paper,the database was reconstructed,the reading process was optimized,and an improved algorithm based on Hadoop is proposed to delete the non-frequent items to reduce redundant data.The algorithm effectively reduces the transaction database and counting with hash tree reduces the counting time and improves the efficiency.(2)A load balancing data segmentation improved FP-Growth algorithm based on Hadoop platform is proposed.Under Big data environment,FP-Growth serial algorithm can't deal with massive data,and PFP(Parallel FP-Growth)can't deal with a certain amount of data.The improved algorithm uses load estimation and improved equalization grouping method to overcome the disadvantage that PFP data can not be processed and load is unbalanced.The improved algorithm can effectively balance the load of each node in the cluster and shorten the running time of the whole cluster.A comparative experiment was constructed after building the big data Hadoop platform framework.The effectiveness of the algorithm was verified by authoritative data.The experiments show that the improved algorithm can better adapt to big data,and is more efficient.

Keywords/Search Tags:

Big data, association rules, Hadoop, reconstruction, load balancing

PDF Full Text Request

Related items

1	Research On Algorithm And Application Of Big Data Association Rules Mining Based On Hadoop
2	Research On Parallel Association Rules Algorithm Based On HADOOP Platform
3	The Research Of Quantitative Association Rules Data Mining Based On Hadoop
4	Research And Application On Association Rules Mining Algorithm Base On Hadoop
5	Research On Association Rules Algorithm Based On Hadoop
6	Research And Implementation Of Mining Algorithm For Association Rules In Big Data Based On Hadoop
7	Mining Association Rules Algorithm Analysis Based On Hadoop
8	Research On Energy-aware Load Balancing In Heterogeneous Hadoop Cluster
9	Research On Load Balancing Algorithm For Scheduling Based On Hadoop
10	A Survey Of Mining Association Rules Algorithm In Big Data