Font Size: a A A

A Frequent Serial Episode Mining Algorithm With Time Constraints Based On Spark Platform

Posted on:2020-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:S Z PengFull Text:PDF
GTID:2428330602950576Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Pattern mining problem,counting the frequency of serial episodes from a streaming sequence has drawn continuous attention in academia due to its wide application in practice.Although a number of serial episodes mining algorithms have been developed recently,most of them are neither stream-oriented,as they require multi-pass of dataset,nor time-aware,as they fail to take into account the time constraint of serial episodes.In the algorithm for counting the frequency of occurrence of each sequence pattern,there is an algorithm that can compute the frequency of given episodes satisfying predefined timeconstraint as signals in a stream arrives one-after-another,named ONCE.However,the algorithm can only be applied to the case where there is no intersection between the sequence modes.When there is some intersection between the sequence patterns,the result of using the ONCE algorithm is inaccurate.In this paper,we have modified the ONCE algorithm to make accurate and unambiguous results when there is a cross between sequence patterns.With the popularity and development of the Internet,telecommunication,industrial systems etc.On one hand,due to the huge number of events,analyzing the sequences is timeconsuming,so the algorithm that needs to be processed must be efficient and can be calculated in parallel.On the other hand,since streaming data is infinite and non-uniformly generated,the calculation method of streaming data must be dynamically updated and stored,and it must be equally efficient.First,because unlimited data cannot be stored.In a limited space,the second is because if the efficiency of the processing is not high,congestion may occur and data loss will occur.However,the sequence pattern that we need to mine in the above requirements must meet the time constraint,and most current algorithms cannot meet the requirements.In order to adapt ONCE to the era of the current data explosion.Moreover,we also present a pair of advanced models,Spark ONCE and Streaming ONCE,respectively.Both of these approaches are built on ONCE.For the current most efficient processing of big data components Spark,the calculation process of the ONCE algorithm has been appropriately changed.Streaming ONCE can handle the sequence mining problem faced by streaming data.Spark ONCE can perform statistics and calculations on signals that generate dense streaming data and large-scale data with little time and space,and can process millions of signals per second.After discussing the statistical algorithms for the frequency of occurrence of each sequence pattern,we also discussed how to use Spark ONCE to combine the Apriori algorithm and the FP-growth algorithm to perform efficient sequential pattern mining on massive data.After using the Streaming ONCE algorithm,how to use the time tilt window to dynamically update and store the statistical data to store unlimited data using a limited space,and to ensure stable processing time and correctness of data statistics.In summary,in this paper,we mainly modify the existing ONCE algorithm,so that it can accurately calculate the frequency of occurrence of the intersection between sequence patterns,and then propose Spark ONCE and Streaming ONCE Algorithms to meet the needs of time-constrained frequent sequence mining for massive data and streaming data.
Keywords/Search Tags:Spark, Sequence Mining, Massive Data, Streaming Data
PDF Full Text Request
Related items