Intelligent Scheduling Method To Spark Workflow With Distributed Privacy Data

Posted on:2021-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:J Fu

Full Text:PDF

GTID:2518306557492524

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The big data technology has been widely applied to domains such as medical care,insurance,government,and scientific research under which part of data are with high sensitivity.With the frequent exposure of private data leakage and the introduction of laws and regulations on protecting data privacy,privacy protection shall be taken into account when using Spark clusters to efficiently process massive amounts of data.A problem with distributed privacy data in multiple data centers is considered to schedule Spark tasks in this thesis.The objective is to minimize the makespan of a Spark application.The main challenges includes three aspects:(i)How to ensure that the scheduling process meets the privacy constraints of Spark applications;(ii)How to prevent a small number of resources meeting privacy constraints from impeding scheduling;(iii)How to obtain an optimal task sequence in the DAG composed stages and tasks that can be executed independently in parallel to minimize the makespan.According to the characteristics of Spark scheduling problems in heterogeneous clusters,a corresponding mathematical model is established.Based on the existing Spark framework,a system architecture that considers application privacy constraints is designed and an intelligent algorithm framework for Spark task scheduling is proposed.The framework includes four algorithm components: task serialization,chromosome coding,population initialization,fitness function,and genetic operation.In the task scheduling part,the task ordering rules including stage priority,privacy priority,and data volume priority are proposed.In the chromosome coding part,binary chromosome coding is designed to reflect task sequence and scheduling scheme,and reference matrix is designed to ensure that individuals meet application privacy constraints.As for the fitness function part,adaptation degree function is displayed to direct individual selection.The genetic manipulation contains the following five aspects: an makespan minimization strategy is proposed to initialize one individual in order to generally direct algorithm search and evolution;other effective initial individuals are generated randomly based on the reference matrix;a selection strategy that retains the highest value of the parent's fitness is put forward;a crossover mechanism which adopts different crossover probabilities and multiple crossover methods is established based on individual fitness,to make new individuals after crossover evolve towards a higher degree of fitness;different mutation probabilities are used according to individual fitness and effective individuals are generated based on the reference matrix.To evaluate the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the parameters,then the best parameter combination is determined by means of plots.Two related algorithms are used as baseline algorithms.The performance difference is compared and analyzed between the proposed algorithm and the baseline algorithms from different aspects.Experimental results show that the proposed algorithm outperforms the compared algorithms with different job numbers and data center numbers.

Keywords/Search Tags:

Spark, Scheduling optimization, Privacy data, Genetic algorithm

PDF Full Text Request

Related items

1	Scheduling Big Data Tasks With Data Security And Privacy Constraints
2	Real-time Mass Data Processing Analysis And Optimization Based On Spark
3	Research On Job Scheduling And Memory Cache Optimization Based On SPARK
4	Improvement Of Genetic Algorithm And Using In Application And Research Of Scheduling Optimization
5	Spark Task Scheduling With Data Skew And Deadline Constraints
6	Research On Cloud Workflow Scheduling Method With Privacy Protection
7	Research On Mine Resource Optimal Scheduling Model Based On Genetic Algorithm
8	Research On Memory Optimization Algorithm Based On Weight Priority Task Scheduling Strategy In Spark Platform
9	Workflow Scheduling With Privacy Protection In Hybrid Cloud Environment
10	Research On Optimization Mechanism Of Containerized Spark Resource Scheduling In Cloud Environment