Font Size: a A A

Research On Distributed Frequent Itemset Mining Algorithm Based On Spark

Posted on:2018-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:S Z ChenFull Text:PDF
GTID:2348330536452513Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since the 1980 s,the rapid development of database and information technology makes the number of large database increasing.When the era of Internet technology comes,massive data generated by many industries.How to make use of this massive data automatically,deal with the problem of “Data Rich,Knowledge Poor” becomes an urgent problem.Data Mining technology is born in such background.Frequent Itemset Mining is an important research of Data Ming.Frequent Itemset Mining is the basement of Association Rules Mining,Correlation Analysis,Regression Analysis,Series Analysis,Local Periodicity,Episode Fragment and other important Data Mining tasks.With the advent of the era of Big Data,how to dig out the useful information quickly in the huge amounts of data becomes very important.In recent years,Apache Spark,a fast and general engine for large-scale data processing,provides a new solution to analyze massive data efficiently.For the Frequent Itemset Mining technology,the paper makes full use of the advantage of general engine Spark,design a distributed algorithm for mining frequent itemsets based on Spark.The research mainly contains the follow tasks:Firstly,for the problem of frequent itemset mining algorithms,which are based on multiprocessor systems and Hadoop cluster,have high communication load.We use a partition strategy to transform the original data.This will make each node Data Independent.So they can mine frequent itemset in parallel without communication between nodes.Secondly,for the problem of the traditional frequent itemset mining algorithms have load imbalance.In this paper,the algorithm distributes tasks to each computation nodes in the cluster reasonably by the partition of original data set and the distribution of tasks.It makes the algorithm has load balancing.Then,for the problem of frequent itemset mining algorithms based on multiprocessor systems have no fault tolerance,and the algorithms based on Hadoop cluster are not suitable for iterative computation and have heavy disk I/O cost.In this paper,we decide to use Apache Spark to design the distributed frequent itemset mining algorithm.It makes the algorithm has an excellent performance about efficiency,scalability,load balancing and fault tolerance.As for the problem that DFPS algorithm,proposed by this paper,may appear the issue that the parallel degree is not high and do not make full use of the computing ability of the cluster when the cluster is massive.This paper presents two optimization strategies to improve DFPS algorithm.One is user-defined strategy and the other is cluster-adaptived strategy.By the way of cutting task into subtasks,improving the parallelism of DFPS algorithm and making full use of the computing power of the cluster.This method makes the algorithm more efficient.At last,in order to verify the DFPS algorithm's practicability and performance,we apply it in the project named Reseach on Constructing Big Data Platform and Studing Big Data Skills Based on SAP Technology.The project including design a Big Data Platform,combine HANA database with R language and do some researches about Data Mining.During the project,we verify the DFPS algorithm's practicability and mining efficiency in the real project.
Keywords/Search Tags:frequent itemset mining, association rules mining, FP-growth, Spark, RDD, Big Data, distributed algorithm
PDF Full Text Request
Related items