Research On Parallelization And Load Balancing Of Frequent Pattern Mining Algorithm Based On MapReduce

Posted on:2020-08-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yan

Full Text:PDF

GTID:2428330578455270

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The research work of this paper is oriented to the processing task of big data.Based on the ideas of "Hadoop technology","parallelization" and "load balancing",the optimization techniques of mining algorithm and load balancing performance under parallel computing are studied,which realizes the parallelism of data processing,the load balancing of cluster system and the reasonable data distribution mechanism in large-scale cluster environment.This paper solves the spatiotemporal consumption and I/O load problem in the mining process by introducing the FIUT algorithm and running it on the Hadoop platform.Mining with the streamlined FIU-Tree can effectively reduce the search space and recursion times.At the same time,utilizing Hadoop architecture cluster and the highly parallel computing MapReduce framework can be used to cope with the needs of big data computing.Therefore,for the big data analysis and processing tasks,this paper combines MapReduce to realize the parallel computing of FIUT algorithm.Considering the order of execution of the FIUT algorithm,hinders the independence of parallel mining,the decomposition process is optimized in this paper.The whole execution process of the algorithm is divided into three MapReduce work executions,so that each computing node independently constructs a local subtree to complete the task of parallel mining.In distributed cluster,load balancing performance is directly related to the efficiency of parallel algorithms.Therefore,in the Hadoop environment,it is also a focus of this paper to coordinate the balance of computational load of each node.For the shortcomings of the existing PFP algorithm in the average packet partitioning mechanism,this paper selects a new load evaluation calculation method and resets the packet partitioning strategy to achieve the balance of global computing;in addition,in order to optimize the load balancing performance of the parallel FIUT algorithm,the paper attempts to optimize the data distribution strategy by considering the impact of the itemsets decomposition cost on the node's computational load,thus proposes a load balancing algorithm for parallel FIUT in Hadoop cluster environment.The algorithm reduces the number of long and short itemsets between multiple Reduce tasks as the grouping standard,and quantifies the load weight parameter to estimate the computing load when the node processes the task to provide the basis for data distribution between the groups;At the same time,in order to visually reflect the data skew of the current cluster,the parallel entropy is studied and defined as the load balancing factor.By analyzing its basic theoretical ideas,the relationship between parallel entropy and the overall load of the cluster is derived.Compared with the existing PFP algorithm based on MapReduce framework,the experimental results on the webdocs.dat dataset shows that the proposed optimization scheme can effectively improve the parallel mining performance of the algorithm and meet the expected results.

Keywords/Search Tags:

frequent pattern mining, MapReduce, parallel computing, load balancing, parallel entropy

PDF Full Text Request

Related items

1	Parallel Frequent Itemset Mining Based On MapReduce
2	Research On Parallel Frequent Graph Pattern Mining
3	Study And Implementation On Techniques Of Parallel Mining Of Frequent Closed Sequences Based On Vertical Segmentation
4	The Design And Implementation Of Parallel Computing Platform Based On MapReduce
5	Research Of Parallel Frequent Itemset Mining Algorithm Based On MapReduce
6	Research On Spark-based Parallel Contrast Pattern Mining Algorithm And Load Balancing
7	Research On Frequent Pattern Mining In A Single Large Graph
8	Load Balancing Problems For Parallel And Distributed Computing
9	MapReduce-based Parallel Data Mining Services For TCM
10	Mpich-based Parallel Computing System, Load Balancing Technology