Font Size: a A A

Research On Mining Association Rules Based On High-dimensional Data And Incremental In Big Data Environment

Posted on:2022-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:X C ZhaoFull Text:PDF
GTID:2518306524497174Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the development of technologies such as the Internet and sensors,the informatization of society is continuously being promoted,and the production speed of global data is also increasing rapidly.As we all know,mobile communication data is the main artery that carries communication services and Internet information,and is an important infrastructure for the development of the network information age.Therefore,the concept of communication in big data environment has attracted more and more attention,and the mining of data value information has gradually become a research hotspot.Among them,the main focus is to analyze and discover the association rule mining technology between items in the data set is an important branch of data mining.At the same time,the potential value hidden in the rules excavated by this research is also unknown before.Therefore,how to efficiently and accurately detected the communication association rules in the big data environment has become a hot research topic.Although the improvement of traditional association rule mining algorithms in the era of big data has achieved certain results,the characteristics of diverse data types and fast update speed in the big data environment make the research of such algorithms still have a plenty of explore space.In addition,the complexity of the existing improved algorithm in the execution process is still very high,and it is difficult to adapt to the parallel computing in large-scale data.In view of this,this article starts with the data types to be mined and the characteristics of communication data in big data environment,and improves the accuracy of the mining results and the efficiency of algorithm execution through the improvement of data preprocessing and algorithm steps.The main work of this paper is as follows:In view of the large-scale high-dimensional data mining process based on the FP-growth algorithm,there are problems of inaccurate data feature capture,unbalanced node load,frequent data interaction,and low compactness of frequent itemset in the mining process of large-scale and high-dimensional data based on the FP-growth algorithm.A parallel mining algorithm based on MapReduce which named PARDG-MR(the Parallel Association Rules Mining Algorithm by using Dimension Granulating based on MapReduce)is proposed.According to the characteristics of the data,the algorithm first proposes a dimensional granulation and strategy of grouping based on DGA(the dimension granulated Algorithm,DGA)and load balancing algorithm GPL(the algorithm of Grouping method based on prefix length,GPL).These methods are based on method of load estimation which named as DGPL,so as to complete the accurate capture of high-dimensional complex data feature attributes,and solve the problem of node load imbalance in data partitioning;secondly,a frequent itemset parallel mining strategy named PARM(Parallel Association Rules Mining Algorithm,PARM),which based on PJPFP-Tree is proposed.The PARM strategy is used for realize the parallelized grouping process of frequent itemset and improve the overall speed of the algorithm;finally,based on the candidate pruning strategy,a pruning prefix lemma(PPL)with integrated result is proposed,which named PJPFP(Pruning JFP-growth Algorithm,PJPFP).This theory improved the efficiency of pruning in the process of frequent itemset mining,enhanced the compactness of frequent itemset,and further improved the overall mining efficiency of the algorithm.Theoretical analysis and experimental results show that the PARDG-MR algorithm not only effectively overcomes the bottleneck of high-dimensional data mining,but also greatly improves memory consumption and mining efficiency.Aiming at the problems of Apriori association rule mining algorithm based on MapReduce framework,such as the problem of long candidate set generation,low algorithm execution efficiency,and the rapid update of data in the big data environment to incremental processing,this paper has proposed a method based on weighted dynamic updating itemsets of Apriori,named WDU-Apriori.Firstly,the algorithm proposes the W-DPC(Weighted Dynamic Passes Combiner)mechanism as a combination of its candidate set,which effectively improves the adaptability of the algorithm in a big data environment.Secondly,for new incremental data,a WBI(Weighted Border)strategy which is designed to generate weighted boundary itemsets to improve the mining efficiency of incremental data.Finally,a CTP(Calculate Transform Probability)method is constructed to quantify the possibility of threshold boundary itemsets becoming frequent itemsets.The previous mining results have a more efficient application,which reduces the node load and solves the problem of frequent scans of the original data set.Theoretical analysis and experimental results show that the WDU-Apriori algorithm not only effectively improves the mining efficiency,but also effectively reduces the time complexity of the algorithm.
Keywords/Search Tags:frequent itemset, candidate set, high-dimensional data, incremental data, parallelization
PDF Full Text Request
Related items