Font Size: a A A

Research And Implementation On Efficient Parallel Frequent Itemsets Mining Algorithm Based On Spark

Posted on:2019-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2428330563992488Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid increase of data volume in real life,data mining has attracted great attention in various research fields,especially in the game field.Frequent itemsets mining is a very popular data mining technology and plays an important role in many important data mining tasks.However,with the rapid development of Big Data,people's demand for valuable information in data is increasing,for an example,people want to discover the growth path of user in large amounts of game data to give users a better gaming experience,but hardware conditions are unable to meet people's need for fast information mining.In other word,under the premise that hardware conditions and data volume can not be changed,the existing frequent itemsets mining algorithms can no longer satisfy people's desire for effective information within an effective time.Therefore,an efficient algorithm for parallel frequent itemsets mining is studied and implemented,which becomes an important direction in the field of data mining.An efficient parallel frequent itemset mining algorithm is proposed,named PNPFI.The algorithm is implemented based on the Prepost algorithm and Spark platform.PNPFI is implemented parallel and independence between nodes on the Spark platform,and it proposes a novel algorithm of N-lists intersection,it stops the process of N-lists intersection in advance through judging whether the result meets the threshold in advance,which greatly reduces the memory and time consumption.In order to further reduce some redundant process of N-list intersection,PNPFI proposes a new concept P-Subsume based on N-list.Through P-Subsume,PNPFI can be directly combine it with items to generate some frequent itemsets,without the intersection of N-lists,greatly reducing the algorithm runtime.In addition,considering the practicality of the algorithm,PNPFI proposes a load balancing strategy to partition transactions by predicting item loads so that the clusters achieve load balancing.The experimental results show that compared with the classical parallel algorithm and the recently proposed parallel algorithm,PNPFI shows a great advantage in terms of performance and memory overhead,with a maximum performance increase of 70% and an average increase of 39%;memory consumption can be reduced by a maximum of 90%,and can be reduced by 71% on average.
Keywords/Search Tags:Data Mining, Frequent Itemsets Mining Algorithm, Parallel, Load Balance
PDF Full Text Request
Related items