Font Size: a A A

Research On Spark-based Parallel Contrast Pattern Mining Algorithm And Load Balancing

Posted on:2022-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:L FangFull Text:PDF
GTID:2518306731477854Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Contrast pattern clearly describes the various differences between data sets conta-ining class labels.It can capture the distinguishing characteristics between various types of data and is often used to build highly accurate classifiers.However,contrast pattern mining is a NP-hard problem,and the candidate item set it produces is close to the 2~k power.Generally,traditional mining algorithms run serially on an independent machine,but a single machine has limitations in terms of CPU and memory.Therefore,the traditional methods have bottlenecks in mining contrast patterns,especially when faced with high-scale and high-dimensional data sets,which are prone to memory overflow,ineffective mining and other problems.Aiming at the above problems,this paper studies Spark-based parallel comparison mode mining algorithm,and integrates a new load balancing strategy to effectively mine high-scale and high-dimensional data sets.The research work of this paper is as follows:First of all,this paper presents a parallel comparison pattern mining algorithm based on Spark.The algorithm first constructs an EDCP-Tree,then generates a 1-item KCP?Info array structure through EDCP-Tree.On this basis,it constructs the m-item candidate pattern structure which is independent of each other.Finally,it mines the contrast pattern in parallel according to the suffix candidate pattern of m-item.This algorithm divides the search space of the contrast pattern into independent units,which can be mined in parallel,providing a scalable solution to large scale and high dimensions data sets.In this paper,we use two datasets of different sizes and dimensions to test the performance of the algorithm on spark cluster.Experimental results show that the proposed algorithm achieves high parallelism and scalability.In addition,In the distributed Spark cluster environment,the execution efficiency of the parallel algorithm is controlled by the node with the longest running time.Therefore,the mining efficiency of the parallel algorithm is affected by the computation of each node in the cluster.Consequently,in the Spark cluster environment,between each node.The balance of the amount of computation of this paper.For Spark's default Hash Partitioner strategy and Range Partitioner strategy,when dividing data sets,it does not consider the weight calculation between nodes,but directly partitions according to the key or range,which is easy to cause data skew and other problems.To solve this problem,this paper proposes a load balancing algorithm BS-SPCP for contrast pattern mining.The algorithm comprehensively considers the item set generation cost and the number of comparisons between the itemsets,so as to estimate the load weight of nodes and balance the computing load among nodes in spark cluster.That is,by estimating the computation generated by the m-item candidate mode,the total calculation amount in each node tends to be consistent,so as to realize load balancing of cluster and improve parallel efficiency.By comparing the running time of the BS-SPCP algorithm before and after load balancing,it shows that the load balancing algorithm proposed in this paper can make the total amount of calculation between nodes tend to be consistent,so as to improve the parallel efficiency.
Keywords/Search Tags:Spark, Contrast Pattern Mining, Load Balancing, Parallel Computing
PDF Full Text Request
Related items