Font Size: a A A

Research On Efficient Mining Algorithm For Rare Itemsets

Posted on:2019-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:S N LiuFull Text:PDF
GTID:2428330590465785Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Association rule mining technology is one of the important methods of data mining,its main significance is able to find out potential and valuable correlations between different items from the data.According to the frequency of the occurrence of items,items in the data can be divided into frequent items and rare items.Among them,the mining of frequent itemsets is currently the focus of attention.However,it will filter out rare items in the process of mining,and studying rare association rules can find many unknown and valuable laws of reality.Moreover,applying these laws in some areas can bring great economic and social benefits.Therefore,how to quickly and effectively mine rare association rules from data to provide decision-makers with more scientific basis for planning is an important topic in the field of data mining.With the arrival of the era of big data,data is rapidly growing,so how to quickly and effectively mine rare itemsets from large-scale data is a key issue.Based on the distributed computing framework Spark,this thesis implements the parallelization of rare itemsets mining algorithms according to the characteristics of rare itemsets mining algorithms,so that the algorithm can handle large data problems quickly and efficiently.The main research work of this thesis is:(1)Firstly,the thresholds and filter conditions of the DEclat algorithm are reset so that the improved DEclat' algorithm is suitable for the mining of rare itemsets.However,when the DEclat' algorithm mines rare itemsets,a large number of intersection operations result in inefficient execution of the algorithm.To solve this problem,REclat algorithm based on the idea of hash Boolean matrix is proposed.The proposed algorithm reduces the time required for each compution of intersection set,that is it reduces the time for the count of candidate set support.Theoretical analysis and comparison experiments show that REclat algorithm has good execution efficiency in the mining of rare itemsets of data sets with different number of transactions and different number of attributes.(2)In order to implement REclat algorithm to effectively mine rare itemsets in big data environment,SP-REclat algorithm for parallelization in Spark framework which according to the characteristics of REclat algorithm is proposed.Firstly,the equivalence class division is carried out on the itemsets with the same prefix,so that the same equivalence class is divided into the same computing node.Then,the k-item equivalence class of the same node can be directly connected to generate a(k+1)-item rare itemsets.Finally,the equivalence class division is carried out again on the(k+1)-item rare itemsets generated by each node.The SP-REclat algorithm is iteratively called to mine the set of rare itemsets until no more more itemsets are produced.Therefore,the parallelization of the REclat algorithm under the Spark framework is realized.The experiments show that SP-REclat algorithm is feasible and effective,and it has a good speedup and scalability.
Keywords/Search Tags:association rules, rare itemsets, Eclat algorithm, parallelization computing
PDF Full Text Request
Related items