Font Size: a A A

Research On Frequent And Closed High Utility Itemset Mining Algorithm Based On Spark

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:T Y WeiFull Text:PDF
GTID:2518306524970159Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Extracting valuable information from datasets is an important task in the field of data mining.Among them,the research on frequent itemset mining and high-utility itemset mining has become a hot issue.However,whether it is in frequent itemset mining or highutility itemset mining,a single measurement can hardly reflect more value of the itemset,and the number of itemset results excavated in these two mining tasks is too large,making the user's processing of the results is very time-consuming.In response to the above problems,the main work and innovations of this article are as follows:1.In order to mine more valuable itemsets,this paper combines support measure and utility measure,and introduces the concept of closed itemsets to reduce the number of itemset results,then proposes the problem of frequent closed high-utility itemset mining.Frequent closed high-utility itemset as a compact representation not only have a small number,but also can provide lossless information.2.Aiming at the problem of frequent closed high-utility itemset mining,this paper proposes the Frequent Closed High-Utility Itemset Miner(FCHUIM)algorithm,and proposes a number of efficient data structures and pruning strategies to improve the performance of the algorithm.These include the adoption of the Total List Structure,which is used to store itemset information and enable the algorithm to quickly access these information;the proposing of the Extension Utility pruning upper bound,which is more compact than the pruning upper bound used by previous algorithms and can filter more low-utility itemset;the adoption of the Pre-check method,which is an itemset subsumption relationship detection strategy proposed by combining the structure of the algorithm and the generation order of the itemset;the proposing of the Nested List Structure,which can reduce the candidate itemsets of the frequent closed high-utility itemsets that are stored in different data blocks according to their support values,so that the algorithm can further eliminate infrequent itemsets while efficiently mining the frequent closed high-utility itemsets in each data block.Finally,simulation experiments on real datasets and synthetic datasets verify the effectiveness of the FCHUIM algorithm.Compared with the latest closed high-utility itemset mining algorithms CLS-Miner and CHUI-Miner,this algorithm has higher performance.3.As the development of big data technology matures,many data mining algorithms use distributed platforms to improve their performance and efficiency.In order to meet the needs of quickly mining large datasets,this paper uses the Spark platform to implement the distributed parallelization of the FCHUIM algorithm,and proposes the Parallel Frequent Closed High-Utility Itemset Miner(PFCHUIM)algorithm.Simulation experiments show that the algorithm can meet the needs of frequent closed high-utility itemset mining in the big data environment,and at the same time the performance of the algorithm is greatly improved,which shows that the method is effective and feasible.
Keywords/Search Tags:Compact representation, Data mining, Frequent closed high-utility itemset, High-utility itemset, Spark platform, Parallel algorithm
PDF Full Text Request
Related items