Font Size: a A A

Research On Frequent Itemsets Mining Algorithm Based On Mapreduce

Posted on:2020-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q GuoFull Text:PDF
GTID:2428330590471598Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Big data is not only a hot issue,but also an important resource.Data is everywhere in our life,how to obtain useful information from data is very important.Data mining can apply the value of data to daily production.Therefore,data mining becomes an important technology to process and analyze big data.There are different branches of data mining,such as classification,association analysis,cluster analysis and anomaly detection,among which association analysis is a hot topic in data mining algorithms.The main research contents of association analysis are as follows:Through the study of classical data mining algorithms,it is concluded that classical data mining algorithms generally have problems such as low efficiency and large memory loss.To this end,Apriori's improved algorithm(GNA)combined with genetic algorithm is proposed,which is an algorithm based on new genetic algorithm to find frequent itemsets.Simple and easy to implement is the strength of the Apriori algorithm,but the splicing and generation process of the candidate set is too complicated,and the Apriori algorithm scans the database once for the candidate set.These defects are caused by the low efficiency and high memory loss of the Apriori algorithm.main reason.By combining genetic algorithm to optimize the search space,and using Apriori's pruning strategy,a new algorithm for simplifying the splicing and generation process of candidate set of Apriori algorithm with constrained crossover and mutation operator is studied.Traditional data mining algorithms are mining in independent mode,and their mining efficiency is not suitable for big data mining.Therefore,the improved Apriori algorithm is combined with Hadoop,and a big data association pattern parallel mining algorithm based on MapReduce(Mr_GNA)is proposed.The Mr_GNA algorithm combines the GNA algorithm with Hadoop's MapReduce parallelization computing framework to implement parallelization of the algorithm.In order to ensure that the Mr_GNA algorithm can be efficiently mined under the Hadoop cluster,a reasonable load balancing strategy is adopted.The frequent pattern is evaluated using the Kurczynski coefficient and the support imbalance ratio IR.The experimental results show that the Apriori improved algorithm combined with genetic algorithm has advantages in time complexity,memory loss and mining efficiency compared with Apriori and NSFI algorithms.The improved big data mining algorithm is more efficient in cluster mode,and is superior to the parallel big data mining algorithms such as MRApriori and PFP-Growth.It proves that the Mr_GNA algorithm can effectively mine frequent patterns and meet the needs of big data mining.
Keywords/Search Tags:data mining, apriori, association analysis, parallelization
PDF Full Text Request
Related items