Font Size: a A A

Research On Frequent Pattern Mining Algorithm In Big Data Environment

Posted on:2020-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:L WuFull Text:PDF
GTID:2428330596995132Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Association rule mining,an important branch of data mining,is applied to discover hidden connections in data to provide decision supports.It has been widely applied to many fields,such as web mining,recommender system and fault diagnosis.The most important and time-consuming step of association rule mining is the acquisition of frequent patterns.At present,a lot of research has been carried out to speed up the frequent pattern mining at home and abroad.However,in the age of the big data,the current frequent pattern mining algorithms cannot satisfy the growing requirements for the low time cost,while the amount of data becomes more and more larger.How to improve the efficiency of frequent pattern mining in big data environment is still a huge challenge in the field of data mining.In order to improve the efficiency of the frequent patterns mining algorithm,the following research is carried out:(1)Combining the classical Apriori algorithm,FP-growth algorithm and ECLaT algorithm,a frequent pattern mining algorithm based on Interval Interaction and Transaction Mapping(IITM)is proposed.Fitstly,the proposed algorithm just needs to scan the dataset twice.Frequent 1-itemsets are generated in the first scan,and a conditional pattern tree is generated in the second scan.Then,the intervals of all frequent 1-itemset are obtained by scaning the conditional pattern tree.Subsequently,the frequentpattern growth performed by the interval intersection can avoid the time cost of recursively generating conditional pattern trees.At the same time,a lot of measures are introduced to improve the efficiency of the algorithm,i.e.introducing Hash storage structure to sotre the intervals of the itemsets,useing the Bloom filter to filter out the non-frequent itemset and optimizing the interval intersection.(2)On the basis of the IITM algorithm,Parrallel Interval Interaction and Transaction Mapping(PIITM)algorithm is proposed,which is based on the big data processing platform Spark.The PIITM algorithm divides the conditional pattern bases of different suffixes into different machines(nodes)to make the data of each node independence.Therefore,the frequent item set mining is parallelly performed by the PIITM algorithm in each nodes.Meanwhile,the PIITM algorithm considers the load capacity of the nodes and the original data distributions among the nodes in dividing the data,which can balance the load of each nodes as much as possible.Also,the PIITM algorithm devides the data into the nodes with the most conditional pattern base as much as possible to reduce the unnecessary data exchange in the phase of data division.In order to make the PIITM algorithm more efficient,more extensible and fault-tolerant,the Spark big data processing engine is applied to distributed data mining.In the last of this paper,the performances of the two proposed methods and the orther up-to-date algorithms are evaluated on multiple real data sets.The experiments show that the IITM and PIITM algorithms attrive satisfy performace under different support degrees on multiple real data sets.
Keywords/Search Tags:Big data, Data mining, Frequent patterns
PDF Full Text Request
Related items