Font Size: a A A

Research On Fast Mining Algorithm Of Frequent Itemsets And Its Distributed Implementation

Posted on:2021-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:H B LiFull Text:PDF
GTID:2518306200453734Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information sharing era,billions of data are produced every day in the world.The amount of useful data increases as the overall base increases,but it is still a minority.Faced with such a situation,it is urgent to discover it quickly.Data mining technologies such as classification,clustering and frequent pattern mining are widely used to extract useful or relevant information from large data sets.As one of the most frequently used methods,frequent itemset mining can mine interesting and intrinsic patterns from a large number of data,so as to help people make better decisions.In recent years,it has been widely used in crime prediction,network security,stock prediction,biological gene information analysis and other fields.Because most frequent itemset mining is iterative process,which usually leads to high time complexity,it is of great theoretical and practical significance to design a fast and efficient frequent pattern mining algorithm.The traditional mining algorithm based on candidate-test mainly faces the problems of too many candidate itemsets and frequent scanning of data,while the non candidate-test mining algorithm mainly uses FP-Tree to compress the database,and then recursively mines by constructing the conditional pattern tree of items.When the number of items is large,recursion mining is easy to cause the problem of memory overflow.With the increasing amount of data,the efficiency of the traditional standalone algorithm to produce frequent itemsets is getting lower and lower.Due to the characteristics of iterative computing and the emergence of many distributed computing frameworks,frequent pattern mining provides a new way to solve the problem of mining efficiency,and the research of frequent pattern distributed mining is the current research hotspot.By reading a large number of related literatures in the field of frequent itemset mining and distributed computing,a novel bitstring pattern growth strategy is designed,using bitstrings to express itemsets,and using bit operations to quickly generate candidate itemsets,and proposes Mafibs algorithm based on bitstring.Secondly,aiming at the problem of low computational efficiency of long bitstring,a bitstring grouping strategy and a bitstring join strategy are designed.Based on these two strategies,an improved Fmafibs algorithm is proposed based on Mafibs algorithm.The algorithm groups the long bitstrings,first generate the frequent bitstrings contained in each group,and then the frequent bitstrings generated by different groups of the same transaction are used to obtain the candidate bitstrings between the groups utilize the join strategy.The candidate bitstrings generated by the join strategy are aggregated and filtered to obtain the final frequent bitstring,and finally the frequent bitstring is parsed to obtain all frequent itemsets.Spark is very suitable for iterative operation because of its special caching mechanism and easy to handle big data.The algorithm is implemented on Spark computing platform,and the benchmark dataset in frequent itemset mining field is used for experimental validation.The final results indicate that the proposed algorithm can raise the mining efficiency while ensuring the accuracy of mining results.
Keywords/Search Tags:frequent itemset, bitstring, distributed compute, Spark
PDF Full Text Request
Related items