Font Size: a A A

Research On Distributed Association Rule Algorithm

Posted on:2018-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:T B GaoFull Text:PDF
GTID:2428330566967363Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Association rule algorithm is a very important branch In the field of data mining.With the rapid development of computer technology and Internet,finance,telecommunications,insurance daily data show explosive growth,distributed association rules algorithm has a broad space for development today.The existing parallel Apriori data mining algorithm has the problems of multiple scanning database,memory loss,large nodes communication,which can not be optimized simultaneously.This dissertation proposes a parallel Apriori algorithm,which converted the original database into Boolean matrix and weight matrix,and reduced the consumption of memory.Using level to cut the matrix into n small matrix,and introducing the single largest set length to limit of the realistic significance of tiny candidate itemsets generation.The support degree and average weight are calculated by matrix operating,shorted the operation time of the algorithm;The minimum support degree and minimum average weight value are used to reduce the generation of candidate itemsets..The main work of this article:(1)The research of Hadoop distributed system:introduced the core technology and operation mechanism in the Hadoop,including distributed file system(HDFS),database HBase and MapReduce computing framework.introduces the basic.concepts of data mining,requirements and basic framework of data mining system based on Hadoop,gives the system model.(2)The improvement research of parallel Apriori algorithm:Aiming at the problems of multiple scanning database,memory loss,large nodes communication and high load of I/O in the existing parallel Apriori algorithm,this paper proposes a parallel Apriori algorithm based on weighted itemsets.The algorithm uses the minimum average weight and the minimum support degree to limit the generation of unfrequent itemsets,calculated the itemsets support degree and average weight with matrix,and sets the maximum length of realistic significant frequent itemsets.Through one time of scanning database to generate all frequent itemsets.(3)Set up the experimental platform to verify the improved algorithm:Through building the Hadoop distributed clusters,compares the AprioriMR algorithm and the weightd itemset parallel Apriori algorithm from the data size,number of nodes,support size of transaction records.Comparison results show that when the min-support degree is certain,the more nodes,the higher efficiency of the improved algorithm,but when the min-support increasing to a certain extent,due to the reduce of greater than the min-support candidate,the improved algorithm efficiency become slow;When the number of nodes increasing to a certain level,the time to merge nodes will also increase,the efficiency of the improved algorithm will decrease.
Keywords/Search Tags:Association rule, Hadoop, weighted itemset, matrix
PDF Full Text Request
Related items