Font Size: a A A

Research And Application Of Association Rules Mining Algorithm Based On MapReduce

Posted on:2019-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2348330545458485Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularization and development of computer software and hardware technology and Internet,a large amount of data from all walks of life are recorded and stored,and show explosive growth.With the continuous growth of data volume and more and more comprehensive data content,we can understand the user's behavior habits,and the user's senses of worth and other important information from these data,which the traditional data does not provide.These acquired information and knowledge,in turn,can guide companies or manufacturers to improve accordingly and provide high quality service,to achieve higher returns.Hence,huge amounts of data with hidden huge value,need to be mined and developed.Therefore,data mining research has become the focus area of researchers.Association rules mining,as an important part of data mining,has been paid more and more attentions.In the view of the inefficient acquisition of frequent patterns in the original static data mining,this thesis focuses on the optimization and improvement of association rules algorithm.Firstly,the knowledge and technology of data mining and the related content of association rules are introduced briefly,including various algorithms and steps.Then,this thesis introduces the classical Apriori algorithm for association rules mining and the popular association rules mining algorithm based on compressed matrix in detail.Then we analyze and discuss their problems,so as to propose our improved and optimized MAR-DPS algorithm.MAR-DPS algorithm not only has a series of pruning strategies,so as to minimize the generation of candidate sets,but also chooses different ways to generate frequent 2-itemsets,so as to save the time as much as possible,according to the characteristics of different data sets.In the experimental parts,we use three data sets to verify the performance of MAR-DPS.Considering that data mining is to deal with dozens of times larger than the past or more data volume,the existing single-node mining method has not been able to satisfy our requirements for the execution time and efficiency.Therefore,parallel computing technology has become a choice we can take a try.At present,the mature and popular distributed framework for parallel computing mainly includes Apache Hadoop and Apache Spark.The two different frameworks have different characteristics:Hadoop is ideal for offline data processing and scenarios that do not require multiple iterations,while Spark's memory-based computing model is more adaptive to iterative computing.And many operators that Spark can provide allow users to focus more on tasks than on the code itself.After comparing the two kinds of frameworks,we choose the distributed platform in Spark,and try to migrate our MAR-DPS algorithm to it so as to solve the difficulties and pressures caused by massive data more easily,and search the association rules in massive data efficiently.
Keywords/Search Tags:Data Mining, Association rules mining, Apriori Algorithm, Deep Pruning Strategies, MAR-DPS
PDF Full Text Request
Related items