Font Size: a A A

Research On Accelerating High-utility Itemset Mining Based On Spark

Posted on:2019-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:H Y XiaoFull Text:PDF
GTID:2428330545971550Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the era of big data has come,resulting in a huge amount of data that needs to be processed.However,traditional data mining algorithms are generally serial algorithms running on a stand-alone system.Due to hardware resource limitations,these algorithms do not scale well on massive data sets and usually spend a lot of time in the computation.Association rule mining aims to explore potential links and linkages that are difficult to discover directly.The traditional association rules mining is based on the "support-confidence" framework,and the strong association rules generated reflect the correlation between transaction.However,this method focuses on the frequency of transaction and ignores the utility value of transaction.Therefore,in some specific cases,some combinations that are not particularly frequent but are highly effective are lost,so that users are not interested in the mining association rules.High-Utility Itemset Mining(HUIM)is an association rule mining based on utility values(weights).It aims to mine the combinations of utility values satisfying the conditions,thus making up for the deficiencies of traditional association rules mining.It is an important part of data mining sub-domain pattern mining.Has a very wide range of applications in daily life,such as: businesses can tap the customer's history in the supermarket shopping information,to get the highest sales of merchandise portfolio,thus helping users to better make decisions.In addition,High-Utility Itemset Mining can also be applied to many other areas such as medical diagnosis,finance,search engines,smart recommendation systems,and bioinformatics.However,there is also an important problem in the High-Utility Itemset Mining: With the increase of data volume,the running time occupied by High-Utility Itemset Mining exponentially increases.The combination of Spark distributed platform and High-Utility Itemset Mining can solve the problem of long running time issue of High-Utility Itemset Mining.Existing research report that EFIM and d2 hup algorithms are the two best performing algorithms in the field of High-Utility Itemset Mining.These two algorithms perform differently on different data sets.Out work first introduces the related concepts of EFIM and d2 hup,then design algorithms to parallelize them.We design and implement a distributed version of the EFIM algorithm and the d2 hup algorithm,respectively.The d2 hup algorithm is divided into two parts based on the data partitioning and the partitioning based on the structure.It is concluded that the d2 hup algorithm based on data partitioning has a significant increase in running time under some data sets,but the limitation is larger.Finally,we test the Spark-based EFIM algorithm and d2 hup algorithm through 11 different data sets.In the experiment,we study the influence of different parameters such as partition numbers,executors and executor-cores on the efficiency of the algorithms,and accelerate on the effects.We conclude that Sparkbased parallel d2 hup and EFIM algorithms have a greater improvement in running time efficiency than serial algorithms.State-of-the-art research has shown that EFIM and d2 hup algorithm are the two best algorithms for highutility itemsets mining.EFIM and d2 hup are the fastest algorithms on different data sets.But given a data set,can we predict which of them runs faster,without running two algorithms? There is currently no automatic selection of the fastest HUIM algorithm.This paper generates 118 data sets by generating and collecting the running time of the two algorithms in the real and simulated data sets,taking into account the characteristics of each data set's length,sparse degree,and data set size as characteristics,with running time as the prediction target,then establishing a model.The accuracy of the prediction was evaluated through experiments and a set of rules based on decision trees was generated.According to the rules,the fastest algorithm between EFIM and d2 hup can be predicted very well.
Keywords/Search Tags:Data mining, Spark, High-Utility Itemset Mining, d2hup, EFIM
PDF Full Text Request
Related items