Since the problem of frequent itemset mining or fim in short was put forward, it has attracted enormous researchers to improve the efficiency of fim algorithms due to the high time complexity. Traditional fim algorithms are not good at dealing with big data due to they are limited in computing capability and memory space of a single computer.According to the comparison of current fim algorithms and the study of Spark framework, this paper proposes a new itemset representation called HybridNodeset. Meanwhile, this paper proposes a new serial fim algorithm based on HybridNodeset called HybridFIN. The experimental results demonstrates that this algorithm has a better performance on different types of datasets. Besides, this paper applys the new itemset representation to maximal frequent itemset mining problem and adopts a new projection strategy based on MFI-Tree. This paper also proposes a parallel fim algorithm based on Spark called PHybridFIN. PHybridFIN projects the original transactional dataset into multiple conditional datasets and adopts Transaction Trees to reduce the time cost on network transmission. The experimental results indicate that PHybridFIN is superior to PFP which is implemented in Spark MLlib. Finally, this paper improve the parallelization strategy of PHybridFIN and proposes a parallel fim algorithm called PHybridFIN+. The experimental results show that PHybridFIN+ achieves a better performance. |