Research On Top-k High Utiliy Item Set Mining Based On Spark

Posted on:2020-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z H He

Full Text:PDF

GTID:2428330590471602

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data,all walks of life are filled with massive data.Currently,the ability to quickly and effectively obtain useful information from a large amount of data is a standard to measure whether an enterprise is competitive.The proposal of association rule mining makes it possible to efficiently mine valuable and interesting knowledge and rules in data.But because of the change of mining requirement and the difficulty of threshold setting in mining process.Top-k high utility itemset mining is proposed to make it more widely used in real life.This thesis has studied the efficient mining algorithm of itemset under the single machine mode and distributed cluster environment.Below is the main content:Aiming at the problem that threshold of the existing Top-k high utility itemset mining algorithm is slow and generating massive candidate sets during the iteration memory usage is too large,an improved TKO algorithm based on R-list is proposed.This algorithm uses a data structure called R-list to rapid access the information stored in the list and itemset mining.The algorithm combined with improved RSD threshold lifting strategy to preprocess the data and Use the set enumeration tree to represent the search space.During the recursive search process,the stricter pruning parameters are used to calculate the effect of multiple item sets simultaneously for narrow the search space.Experimental results in different types of data sets show that the improved algorithm is superior to other Top-k high utility itemset mining algorithms in memory efficiency and can maintain stability under the change of K value.In order to solve the problems of low efficiency and memory overflow of traditional mining algorithm when mining large-scale data in distributed cluster environment,Combining the improved TKO algorit hm with Spark,a parallel efficient item set mining algorithm STKO based on Spark is proposed.Choose the Spark platform,and change the original data storage structure.Using broadcast to optimize the iterative process,avoiding a lot of recalculation and using load balancing to realize parallel mining of Top-k high utility itemset.The experimental results show that STKO algorithm can effectively mine high utility itemsets in big data.It can meet the needs of high utility itemset mining of big data sets in a distributed cluster environment.

Keywords/Search Tags:

data mining, big data, Top-k high utility, parallelization

PDF Full Text Request

Related items

1	Mining High-Utility Itemsets Under Various Data Types, Constraints And Applications
2	An Efficient Algorithm For Discovering High Utility Itemsets With Negative Item Values In Large Databases
3	Research On Segmentation And High - Efficiency Itemsets For Data Flow
4	Research And Application Of Concise High Utility Patterns Mining Algorithms Over Data Streams
5	The Research Of High Utility Itemsets Mining Algorithm Over Data Stream
6	Research And Application Of High Utility Pattern Mining Algorithm
7	Research On High-utility Itemset Data Minging Based On Distributed Platform
8	Research On High Utility Patterns Mining Based On Dynamic Indexed Lists
9	Improvement And Application Research Of High Utility Pattern Mining Algorithm Over Data Stream
10	Research On High Utility Pattern Mining Method For Big Data