Font Size: a A A

Mining Frequent Sequential Patterns Based On Hierarchical Tree Under Massive Data

Posted on:2019-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2428330566998099Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The mining of frequent sequential patterns has long been widely used in various practical scenarios.It can provide merchants or companies with various production and sales decision support.However,with the development of science and technology and the continuous improvement of data acquisition and storage capabilities,the volume of data that used for frequent sequential pattern mining in actual scene grows drastically and eventually become tremendous.Algorithms can mine more frequent sequence from massive data,but the efficiency of traditional frequent sequence mining algorithms is far from satisfying the requirements in the actual scene when they are mining on massive data.Not only that,the elements in most real data set have their practical significance,each of them has their own category information.The category information of all the elements can be combined into a hierarchical tree.Conventional frequent sequence mining algorithms can only mine frequent sequence patterns that are made up of the elements appear in the data set.With hierarchical tree,we can obtain more generalized frequent sequences that cannot be mined by traditional algorithms.Existing algorithms for frequent sequence mining based on hierarchical tree have great room for improvement in mining efficiency.At the same time,when mining frequent sequences based on hierarchical tree,the mining results have redundancy problems.Some of the studies have mentioned this problem,but none of them have defined the redundant results specifically,and no solution has been given.In addition,when mining frequent sequence patterns,especially when mining on massive data based on hierarchical tree,the amount of the sequence in mining results will be extremely large,and the user may be interested in only a part of the sequence that matches a specific pattern.Therefore,we need to use several kinds of constraints to limit frequent sequence results to a certain range,such as maximum interval constraint,maximum sequence length constraint,regular expression constraint,and so on.Regular expression constraints enable the algorithm to only mine results that involve specific content.Unfortunately,there is no research on integrating regular expression constraints into distributed data mining algorithms based on hierarchical trees under massive data.We proposed the framework RUMMAGE to solve the above problem.RUMMAGE is divided into four phases: preprocessing,Map,Reduce,and Cleanup.We propose a more efficient projection algorithm PUT based on the projection algorithm of LASH in the Map phase;during the Reduce phase,we first propose the algorithm MINE without redundant operation based on the PSM algorithm,and then we define the regular expression RE-Hierarchy applicable to the hierarchical tree,propose the algorithm REC-MINE to mine frequent sequence that match regular expression constraints based on hierarchical trees under massive data;Finally,the algorithm REI was proposed in the Cleanup phase to efficiently solve the problem of redundant mining results,greatly reducing the number of the result sequence.
Keywords/Search Tags:sequence pattern, distributed, hierarchical tree, redundancy elimination, regular expression constraints
PDF Full Text Request
Related items