Font Size: a A A

Research Of Parallelized Distributed Association Rules Mining Algorithm Based On Hadoop

Posted on:2018-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2348330515996657Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology in recent years,people's daily life through the computers,mobile phones and other terminal platform for a series of acts will produce a lot of data.Not only data,but also access to data is rapidly growing.In the context of the large data age,all kinds of data are growing faster than imagination,large-scale network of business data generated every day can reach hundreds of TB or even PB level.How to obtain information from such a large database quickly,efficiently and accurately is one of the hotspots in computer science research.Parallelized distributed data mining algorithm is an important method for the analysis of massive data that may exist and cross-region,and has very important research significance and practical value.Association rules data mining algorithm is one of the classical data mining algorithms.It has strong learning value and reference value.The traditional association rule mining algorithm exchange the candidate sets,this cause much more network exchange under the premise of parallelization.However,in the context of large amounts of data,the generation of candidate items will be a sudden increase in the situation,easy to load the machine's memory,affecting the efficiency of the algorithm.In this paper,an optimization algorithm Y-IDA algorithm is proposed to complete the process of merging the count directly in memory,instead of the traditional method of outputting the candidate sets one by one to optimize the algorithm.At the same time,modify the Hadoop interface and change the Map Reduce Read the model,the use of the first set of frequent itemsets to clean the database,reducing memory consumption and CPU usage time,improve the efficiency of the implementation of the algorithm.The main work of this paper includes:1)Achieve the basic algorithm serial Apriori,for the follow-up parallel to lay the foundation;2)An optimized algorithm Y-IDA is proposed for the parallel Apriori algorithm.The algorithm completes the process of merging the counting in memory,replacing the traditional method of outputting the candidate sets one by one,and changing the traditional read mode of Map Reduce,The amount of traffic during the execution,and the candidate 1 item set after the cleaning of the data,remove the invalid data;3)On the Hadoop platform,the algorithm of association rule algorithm is implemented.Under the existing experimental conditions,the experimental scheme is proposed to verify that the Y-IDA algorithm is the same as the classical algorithm.In the time efficiency,memory consumption,disk read and write,CPU occupancy and other aspects of a detailed comparison.In this paper,through the Hadoop fully distributed platform,using data mining discrete test data to achieve,we can get the result: The improved algorithm can shorten the execution time,memory consumption,CPU occupancy,disk I / O read and write and can have a better performance,get the improved algorithm has the feasibility and universal significance of the conclusion.
Keywords/Search Tags:Association rule mining algorithm, Parallelized data mining, Apriori, Hadoop
PDF Full Text Request
Related items