Font Size: a A A

Research On 2FP-Forest Algorithm Based On Frequent Itemsets Mining And Its Parallel Processing

Posted on:2020-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z R WangFull Text:PDF
GTID:2428330599453771Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the geometric growth of data,the era of big data is coming.The most important task in the era of big data is to collect data and organize data.Data mining is an important step and the most basic step.Among them,mining frequent itemsets in massive data is the most basic work in data mining,and it is highly valued and researched by researchers at home and abroad.At present,the improvement of the frequent item set mining algorithm is basically based on the improvement of the classic Apriori algorithm and the FP-Growth algorithm.The FP-Growth algorithm has the following disadvantages: scanning the first pass data set only counts the support of the 1-item set,while the scan of the data set consumes more time;FP-Tree simply merges with the same prefix,without considering whether the item set on the path may constitute a frequent item set,does not control the depth,width and number of nodes of the FP-Tree,and the memory consumption is excessive when the amount of data increases;the size of the conditional FP-Tree is large,which makes recursive mining The depth of the increase.Therefore,this paper p roposes to mine 2FP-Forest algorithm based on frequent itemsets and define 2FP-Tree and 2FP-Forest data structures.The 2FP-Forest algorithm counts the support of all 1-item sets and 2-item sets and prune when traversing the first-pass data set.In 2FP-Forest,this paper makes full use of the pruning effect of 2-item set,and performs sufficient pruning and merging nodes in the construction process,so that there is no non-potential candidate 3-item set in 2FP-Forest,and compared with FP-Tree,2FP-Forest has fewer nodes,and 2FP-Tree has less depth and width than FP-Tree,which reduces the depth and width of conditional FP-Tree.By comparing the mining results with the FP-Growth algorithm and COFI algorithm in the experimental data set T10I4D100 K and Mushroom,the comparison of the number of nodes,construction time and the whole mining efficiency of FP-Tree and 2FP-Forest was carried out.The experimental results verify the correctness and efficiency of the proposed 2FP-Forest algorithm.Finally,this paper uses the Python language to provide real-time distributed interface process and multi-process interface thread,parallelization of 2FP-Forest algorithm.Because the process interface can be distributed to multiple computers in real time for parallelization and the multiprocessing module supports multiple processes,the managers sub-module can distribute multiple processes to multiple machines,so a service process can act as a dispatcher to 2FP-Forest.Network communication,distributing each 2FP-Tree to other multiple processes.Since the managers module is well packaged,it is possible to write a real-time distributed multi-process program and implement the parallelized 2FP-Forest algorithm.The experimental results compared with CWBPFP algorithm and PFP algorithm verify the feasibility and efficiency of the parallelized 2FP-Forest algorithm.
Keywords/Search Tags:2FP-Forest algorithm, 2FP-Tree, frequent 2-item set, potential candidate 3-item set, parallelization
PDF Full Text Request
Related items