Research On Spark-based Parallel Contrast Pattern Mining Algorithm And Load Balancing

Posted on:2022-06-04

Degree:Master

Type:Thesis

Country:China

Candidate:L Fang

Full Text:PDF

GTID:2518306731477854

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Contrast pattern clearly describes the various differences between data sets conta-ining class labels.It can capture the distinguishing characteristics between various types of data and is often used to build highly accurate classifiers.However,contrast pattern mining is a NP-hard problem,and the candidate item set it produces is close to the 2~k power.Generally,traditional mining algorithms run serially on an independent machine,but a single machine has limitations in terms of CPU and memory.Therefore,the traditional methods have bottlenecks in mining contrast patterns,especially when faced with high-scale and high-dimensional data sets,which are prone to memory overflow,ineffective mining and other problems.Aiming at the above problems,this paper studies Spark-based parallel comparison mode mining algorithm,and integrates a new load balancing strategy to effectively mine high-scale and high-dimensional data sets.The research work of this paper is as follows:First of all,this paper presents a parallel comparison pattern mining algorithm based on Spark.The algorithm first constructs an EDCP-Tree,then generates a 1-item KCP?Info array structure through EDCP-Tree.On this basis,it constructs the m-item candidate pattern structure which is independent of each other.Finally,it mines the contrast pattern in parallel according to the suffix candidate pattern of m-item.This algorithm divides the search space of the contrast pattern into independent units,which can be mined in parallel,providing a scalable solution to large scale and high dimensions data sets.In this paper,we use two datasets of different sizes and dimensions to test the performance of the algorithm on spark cluster.Experimental results show that the proposed algorithm achieves high parallelism and scalability.In addition,In the distributed Spark cluster environment,the execution efficiency of the parallel algorithm is controlled by the node with the longest running time.Therefore,the mining efficiency of the parallel algorithm is affected by the computation of each node in the cluster.Consequently,in the Spark cluster environment,between each node.The balance of the amount of computation of this paper.For Spark's default Hash Partitioner strategy and Range Partitioner strategy,when dividing data sets,it does not consider the weight calculation between nodes,but directly partitions according to the key or range,which is easy to cause data skew and other problems.To solve this problem,this paper proposes a load balancing algorithm BS-SPCP for contrast pattern mining.The algorithm comprehensively considers the item set generation cost and the number of comparisons between the itemsets,so as to estimate the load weight of nodes and balance the computing load among nodes in spark cluster.That is,by estimating the computation generated by the m-item candidate mode,the total calculation amount in each node tends to be consistent,so as to realize load balancing of cluster and improve parallel efficiency.By comparing the running time of the BS-SPCP algorithm before and after load balancing,it shows that the load balancing algorithm proposed in this paper can make the total amount of calculation between nodes tend to be consistent,so as to improve the parallel efficiency.

Keywords/Search Tags:

Spark, Contrast Pattern Mining, Load Balancing, Parallel Computing

PDF Full Text Request

Related items

1	Research On Parallelization And Load Balancing Of Frequent Pattern Mining Algorithm Based On MapReduce
2	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
3	Load Balancing Problems For Parallel And Distributed Computing
4	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
5	Mpich-based Parallel Computing System, Load Balancing Technology
6	Research On Parallel Frequent Graph Pattern Mining
7	Parallel Research Of GSP Algorithm Based On Spark
8	Research And Implementation Of Dynamic Load-balancing Method Under Parallel Computing
9	Multi-threshold Based Contrast Pattern Mining And Its Application In Classification Of Imbalanced Datasets
10	Design And Implementation Of A Load Balancing System Based On PVM