Font Size: a A A

Research On Top-k High Utiliy Item Set Mining Based On Spark

Posted on:2020-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z H HeFull Text:PDF
GTID:2428330590471602Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,all walks of life are filled with massive data.Currently,the ability to quickly and effectively obtain useful information from a large amount of data is a standard to measure whether an enterprise is competitive.The proposal of association rule mining makes it possible to efficiently mine valuable and interesting knowledge and rules in data.But because of the change of mining requirement and the difficulty of threshold setting in mining process.Top-k high utility itemset mining is proposed to make it more widely used in real life.This thesis has studied the efficient mining algorithm of itemset under the single machine mode and distributed cluster environment.Below is the main content:Aiming at the problem that threshold of the existing Top-k high utility itemset mining algorithm is slow and generating massive candidate sets during the iteration memory usage is too large,an improved TKO algorithm based on R-list is proposed.This algorithm uses a data structure called R-list to rapid access the information stored in the list and itemset mining.The algorithm combined with improved RSD threshold lifting strategy to preprocess the data and Use the set enumeration tree to represent the search space.During the recursive search process,the stricter pruning parameters are used to calculate the effect of multiple item sets simultaneously for narrow the search space.Experimental results in different types of data sets show that the improved algorithm is superior to other Top-k high utility itemset mining algorithms in memory efficiency and can maintain stability under the change of K value.In order to solve the problems of low efficiency and memory overflow of traditional mining algorithm when mining large-scale data in distributed cluster environment,Combining the improved TKO algorit hm with Spark,a parallel efficient item set mining algorithm STKO based on Spark is proposed.Choose the Spark platform,and change the original data storage structure.Using broadcast to optimize the iterative process,avoiding a lot of recalculation and using load balancing to realize parallel mining of Top-k high utility itemset.The experimental results show that STKO algorithm can effectively mine high utility itemsets in big data.It can meet the needs of high utility itemset mining of big data sets in a distributed cluster environment.
Keywords/Search Tags:data mining, big data, Top-k high utility, parallelization
PDF Full Text Request
Related items