Font Size: a A A

Research On Algorithm And Application Of Big Data Association Rules Mining Based On Hadoop

Posted on:2020-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:2428330620462232Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Data mining is a process of mining useful and interesting knowledge from data.Association rule mining is one of the main tasks of data mining,and its purpose is to discover the implicit association between transaction items.As people enter the era of big data,the traditional single-machine association rule mining algorithm has been difficult to meet the demand,showing the mining time is too long,memory can't deal with the data to be mined and other problems,which promotes the research of big data association rule mining algorithm.In the current research on big data association rule mining algorithm,most of them are based on MapReduce and Hadoop,the core of these algorithms is still based on the single-machine association rule mining algorithm,and the performance of these algorithms still depends on the performance of the singlemachine association rule mining algorithm.In this regard,this paper studies an efficient association rule mining algorithm,PrePost,and fully analyzes its existing problems.On this basis,an improved algorithm,Prune-PrePost,is proposed.Then,a parallel algorithm,MRPrune-PrePost,based on the MapReduce,is proposed and applied to mine rules of landslide deformation.The main work of this paper are as follows:(1)Studied PrePost,which is an efficient association rule mining algorithm,and fully analyzed its existing problems.The PrePost algorithm mines frequent itemsets through the intersection of N-lists,which has proved to be an efficient algorithm.A full analysis of PrePost algorithm shows that it has the following problems: the time for mining frequent 2-itemsets is very large;the degree of algorithm pruning is not enough,and there are still a large number of candidates that need to verify the frequency.(2)An improved algorithm Prune-PrePost is proposed.Aimed at the problems of PrePost algorithm,an improved algorithm Prune-PrePost is proposed in this paper.The improvement of the algorithm is as follows: firstly,using a method of “determining frequent itemsets—seeking itemsets related information” to mine frequent 2-itemsets.The whole process does not need to generate candidate itemsets,and does not need subsequent verification of its frequency,so as to improve the performance of the algorithm in mining frequent 2-itemsets;secondely,proposing a pruning strategy that more prunes the itemsets search space,Prune-PrePost employs a set-enumeration tree structure,and uses the set-enumeration tree to represent the search space.A pruning strategy based on the equivalence property of the superset of the itemsets is proposed.This pruning strategy can trim more search space of frequent itemsets,so as to promote the overall performance of the algorithm.This paper implements the proposed PrunePrePost algorithm and verifies its performance through extensive experiments.(3)A parallel algorithm MRPrune-PrePost based on MapReduce is proposed.In this paper,the Prune-PrePost parallel algorithm based on MapReduce,named MRPrune-PrePost,is proposed,and a load balancing data grouping strategy is applied,which can better balance the load of each node in the Hadoop cluster and improve the overall performance of parallel algorithm.In addition,in order to make the final mining results more favorable for people to obtain the information they are interested in,this paper proposes a method of outputting the top-K frequent itemsets.This paper builds a Hadoop distributed cluster,implements the proposed MRPrune-PrePost algorithm,and verifies its performance through extensive experiments.(4)Based on the MRPrune-PrePost algorithm,and combined with massive monitoring data,the response rule of landslide deformation to the inducement factors is studied.Based on the monitoring data of the Baishuihe landslide in the three gorges reservoir area and other environmental data,the response rule of the Baishuihe landslide to rainfall and reservoir water level is mined based on MRPurne-PrePost algorithm,and a series of useful rules are obtained.
Keywords/Search Tags:big data, association rules, Hadoop, PrePost, landslide
PDF Full Text Request
Related items