Font Size: a A A

Research On Optimization And Parallelization Of Top-rank-k Frequent Pattern Mining Algorithm

Posted on:2021-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LongFull Text:PDF
GTID:2428330611960829Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is one of the leading research directions in the field of database and information decision-making.Top-rank-k frequent pattern mining is a method of mining frequent patterns with rank not greater than k in data mining.Its emergence solves the problem of difficulty in setting support threshold.However,the efficiency of the mainstream top-rank-k frequent pattern mining algorithm needs to be improved,and the top-rank-k frequent pattern mining algorithm designed in serial is difficult to deal with the massive data mining task in the era of big data because it cannot break through the hardware resource limit of single machine.Therefore,it is necessary and meaningful to study the optimization method and parallel strategy of top-rank-k frequent pattern mining algorithm.The main work of this article is as follows:(1)Aiming at the problem of high space-time cost of current top-rank-k frequent pattern mining,a hybrid-search-based algorithm HTK(Hybrid-search-based Algorithm of Top-rank-k Frequent Patterns)of top-rank-k is proposed.Its basic idea is: Define a static double-linked list named RSL(Static Doubly-linked List of Top-rank-k)to store top-rank-k frequent patterns,and use the support of 1-pattern and its number of suffix items in the transaction to design a pattern discrimination method to distinguish patterns into short and long patterns.In the mining,firstly,short frequent patterns are mined using the rank-first-search method based on the greedy strategy,which always joins the highest rank patterns in the RSL that are not yet joined;Then,according to the shortest pattern with the longest length,long frequent patterns are mined by level-wise-search.In addition,some reasonable pruning strategies are also designed,and the algorithm also optimizes the generation of the subsume index,making the mining work equally effective for sparse or dense datasets.(2)In order to deal with the massive data mining task,STK(Spark-based Mining Algorithm of Top-rank-k Frequent Patterns),a spark-based mining algorithm of top-rank-k frequent patterns is proposed,the basic idea is: Firstly,the idea of divide-and-conquer is introduced,the items are grouped and sent to each node,the transactions are cut according to the items to which each node belong;Then,each node execute the HTK algorithm to mine the top-rank-k frequent patterns whose prefix element is the item to which the node belongs.Finally,aggregate the mining results of each node to output target patterns.In order to make full use of computing resources,STK also uses a load-balanced grouping method.In this study,real data set and synthetic data set are used to verify the proposed HTK and STK.The results show that HTK algorithm has better space-time efficiency than the existing single machine algorithm,and STK algorithm can effectively mine top-rank-k frequent patterns in big data.
Keywords/Search Tags:Big Data, Parallelization, Spark, Top-rank-k Frequent Pattern, Hybrid Search
PDF Full Text Request
Related items