Font Size: a A A

Research On Association Rules Algorithm Based On Hadoop

Posted on:2019-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z J NiFull Text:PDF
GTID:2428330551956986Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of data,how to efficiently excavate effective data from large amounts of data has become one of the research hotspots in the field of big data.Data mining plays a very important role in finding the value behind the data,and association rules mining is an important research direction in data mining,which is used to discover the relevance between data.As the most core distributed platform of cloud computing,Hadoop has distributed storage and parallel computing components,which provide powerful support for the parallel design and implementation of the mining algorithms.In this paper,we study the algorithm of association rule mining based on Hadoop,and the main contents are as follows:First of all,an improved Apriori algorithm based on fp-tree is proposed to reduce the amount of data scanning in order to improve the speed of Apriori algorithm.From the angle of reducing the amount of data scanning,the improved algorithm compresses the data with fp-tree,and improves the Apriori algorithm through the methods of tail partition,dynamic reduction of data and fast support statistics.Aiming at the bottleneck that the improved algorithm can't handle big data effectively when a single machine executes,the parallel algorithm is designed and implemented under Hadoop.The experimental results show that the proposed algorithm not only has faster mining speed in single machine execution,but also has a good acceleration ratio and data scalability in the cluster environment,which can adapt to the mining of large data.Secondly,the parallelization of FP-Growth algorithm is been analyzed,and the PFP algorithm which is belong to the parallel FP-Growth is analyzed and improved.In view of the fact that the PFP algorithm does not consider the imbalance of packets in the packet stage,the overall performance is not high.A load balancing PFP algorithm is proposed.The improved algorithm constructs a new load prediction model for load estimation.The prediction model first carries out data sampling,and then weights the total number of positions in the head table and the item in the sampling data.Experimental results show that the improved load balancing PFP algorithm has higher overall mining performance and has a good speedup and data expansion rate.
Keywords/Search Tags:Hadoop, MapReduce, Data Mining, Association Rules, Apriori, FP-Growth, Parallel Algorithm
PDF Full Text Request
Related items