Font Size: a A A

High Frequency And Low Utility Pattern Mining Algorithm And Its Implementation On Cloud Computing

Posted on:2019-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z M ChangFull Text:PDF
GTID:2348330542481608Subject:Engineering
Abstract/Summary:PDF Full Text Request
Pattern mining is an important research direction in data mining technology.For the traditional frequent pattern mining and high utility pattern mining,they can only be used to mine the frequent patterns and high utility patterns respectively.In many practical applications,the scope of these traditional simple frequent pattern mining and utility pattern mining model will be relatively narrow,they may no longer meet the requirements of diversified analysis in practical application,and people tend to be interested in both frequency and utility,not just frequency or utility.In order to solve this problem,this paper proposes to consider the support and utility at the same time,and then dig more valuable patterns,one of which is the high frequency and low utility pattern,and proposes a novel algorithm,called HFLUP(High Frequency and Low Utility Patterns Mining Algorithm).The easiest and most direct way to mine high frequency and low utility patterns is to mine them into two phases.First,we use the frequent pattern mining algorithm to mine all the high frequency patterns,then we can find out the patterns whose utilities are less than the maximum utility threshold from these high frequency patterns,i.e.,the high frequency and low utility patterns are obtained.However,this two-phase mining method will generate a large number of candidates,and it needs to traverse the database many times,disk I/O is expensive and the mining efficiency is low.Therefore,in order to avoid these problems,the HFLUP algorithm proposed in this paper is a single phase algorithm without generating candidates,and only needs to traverse the database twice.This paper also proposes a novel structure,called FUL,to store both the utility information about a pattern and the information for pruning the search space,HFLUP can efficiently and directly mine high frequency and low utility patterns from FULs without generating candidates.In order to reduce the search space and improve the mining efficiency,an efficient utility lower bounding pruning strategy is proposed and we look ahead to identify high frequency and low utility patterns without enumeration by a lookahead strategy.Extensive experiments show that these two pruning strategies are efficient,and HFLUP outperforms the two-phase high frequency and low utility pattern mining method in terms of both running time and memory consumption.The second task of this paper is to parallelize the proposed algorithm to meet the requirements of massive data processing,and to overcome the inefficiency of single machine mining caused by the limitation of single physical memory.In this paper,a memory-based parallel computing framework Spark in cloud computing is used to realize the parallelization of the algorithm,and a parallel high frequency and low utility pattern mining algorithm based on Spark PHFLUPS is proposed,so that large scale distributed clusters can be used to mine large data in parallel.Comparative experiments show that the PHFLUPS algorithm is more efficient than the parallel high frequency and low utility pattern mining algorithm based on MapReduce,and the parallel algorithms outperforms the non-parallel type of HFLUP algorithm on large-scale datasets.The idea of this paper and the related technologies proposed are also applicable to mining other types of patterns,such as low frequency and high utility patterns.
Keywords/Search Tags:data mining, high frequency and low utility patterns, pattern mining, big data, MapReduce, Spark
PDF Full Text Request
Related items