Font Size: a A A

Research On High-utility Itemset Data Minging Based On Distributed Platform

Posted on:2021-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:W ShenFull Text:PDF
GTID:2428330611973245Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The advent of the era of the Internet and big data,data from all walks of life have shown explosive growth,which pose major challenges to existing storage solutions and data mining.How to deal with this huge amount of data has become a difficult problem.Data mining technology can not only effectively process the existing data,but also can mine valuable information from massive databases,so as to provide correct guidance for actual production,operation and development.Frequent Itemset Mining(FIM)is a basic method in data mining,which is often used by people to find the connection between various things.FIM only considers the number of occurrences of things,not its own value.Therefore,some scholars have proposed the concept of high-utility itemset mining,which both the value and the frequency of things have a higher practical guiding effect compared to FIM.The purpose of high-utility itemset mining is to mine all itemsets above the threshold in the database.For some high-utility itemset mining algorithms have problems such as low mining efficiency and high memory usage.Efficient high-utility itemset mining algorithm based on reorganized transaction database is proposed.For the problem that the single machine cannot be successfully mined on the large data set,an efficient high-utility itemset mining algorithm based on distributed parallelism is proposed.The main work of this article is as follows:(1)Efficient high-utility itemset mining algorithm based on reorganized transaction database(EIM-DS)is proposed to solve the problems of high time-consuming and large memory in the traditional efficient data mining algorithms.Firstly,a new data set structure is introduced to reorginazed the data set and improve the utilization rate of the data set.Secondly,repeated TWU(Transaction Weighted Utilization)pruning strategy is proposed to reduce the length of the item set.Then,the construct tree is prposed to reduce search space,and use compressed storage to reduce the storage space of the construction tree.Finally,two new pruning strategies are proposed in the search process: extension utility and local TWU utility,and a fast calculation Methods to calculate these two upper limits,further reduce the search space and improve the efficiency of algorithm implementation.Compared with the existing high-utility itemset mining algorithms,the proposed EIM-DS algorithm achieves better performance in terms of execution time and memory.(2)In the EIM-DS algorithm,the improved data set and the compressed and stored data have the characteristics of read-only and not write,and can be used by multiple threads at the same time.Therefore,this paper proposes a multi-threaded EIM-DS algorithm(T-EIM-DS)to further improve the efficiency of the algorithm.Compared with the single-thread version,the execution time of T-EIM-DS algorithm decreases with increasing number of threads,and the memory growth is less than the number of threads.(3)Relying on the characteristics of easy deployment,low overhead and high scalability of the Hadoop platform,a distributed parallel framework is proposed,using EIM-DS algorithm and EFIM algorithm as parallel algorithms,and two distributed efficient itemset mining algorithms are proposed: P-EFIM(Parallel EFficient high-utility Itemset Mining)algorithm and P-EIM-DS algorithm.First,calculate the TWU value of itemset and order according to the TWU.Then,the data set is renumbered according to the ordered itemset sequence,and remove items which lower than threshold to improve data set utilization.Finally,the Map phase decomposes the entire task into multiple independent subtasks.In order to ensure the load balance of each node,an S-type distribution strategy is proposed to distribute multiple subtasks evenly to each node.In the Reduce phase,the P-EFIM algorithm and the P-EIM-DS algorithm use the EFIM algorithm and the EIM-DS algorithm to efficiently mine item sets for subtasks.Compared with the PHUI-Growth algorithm that uses the MapReduce framework,the P-EFIM and P-EIM-DS algorithm has better performance in terms of the execution time.This paper proposes an improved algorithm for the HUIM algorithm.The multithreading method further reduces the execution time of the algorithm.Then,introduces distributed computing,solves the problem of large-scale data sets that are difficult to mine.These can broaden the research scope of the HUIM field.
Keywords/Search Tags:data mining, high-utility itemset mining, pattern mining, big data, distributed computing
PDF Full Text Request
Related items