Font Size: a A A

Research On SPARK Based Massive Data Frequent Pattern Mining Algorithms

Posted on:2017-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y D ZhaoFull Text:PDF
GTID:2308330503487183Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Frequent pattern mining aims to find contents those often appear in data sets. It is one of the most important research directions. According to different data sets, there are two kinds of frequent pattern, frequent itemset and frequent subsequenc e. Because mining frequent patterns costs a lot of computing resources and data sets is getting larger and larger, people must use distributed computing frameworks to guarantee effectiveness. The first part of this paper focuses on mining frequent itemsets in transaction data sets, and research frequent itemset mining algorithms based on distributed computing framework Spark. We first design and implement the Spark versions of classic algorithms, Apriori and FP-Grwoth. And then we propose a two phases frequent itemsets mining algorithm based on Spark which has both FPGrowth’s and Apriori’s features. Through some experiments, we find the advantages and disadvantages of these algorithms, and sum up their applicable scenes. These algorithms can make full use of the resources of clusters and address the needs for mining frequent itemsets on large data sets rapidly. What’s more, this part also introduces how to use the ideas of mine frequent itemsets to mine frequent subsequences in sequence data sets on Spark.Besides the work of mining frequent patterns on Spark, in order to mine frequent patterns in numeric time series data sets, the second part of this paper focuses on time series compression. Compressing time series reduces not only the amount of data but also nosies. The decrease of noises will make the trends of time series much clearer and also will be good for digging out significative frequent patterns. Starting from perceptual important points(PIPs), by extending former work, we design and implement two PIP based time series compression algorithms, compression algorithm based on global PIPs and compression algorithm based on local PIPs. The two algorithms apply to different kinds of time series. And we measure the effectiveness and distortion degree of two algorithms through experiments. Visualization is an important demand when using time series. Because compression algorithms based on PIPs can keep the trend information of time series, they have excellent visualization.
Keywords/Search Tags:frequent patterns, Spark, time series compression, perceptual improtant point
PDF Full Text Request
Related items