Font Size: a A A

Intelligent Scheduling Method To Spark Workflow With Distributed Privacy Data

Posted on:2021-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:J FuFull Text:PDF
GTID:2518306557492524Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The big data technology has been widely applied to domains such as medical care,insurance,government,and scientific research under which part of data are with high sensitivity.With the frequent exposure of private data leakage and the introduction of laws and regulations on protecting data privacy,privacy protection shall be taken into account when using Spark clusters to efficiently process massive amounts of data.A problem with distributed privacy data in multiple data centers is considered to schedule Spark tasks in this thesis.The objective is to minimize the makespan of a Spark application.The main challenges includes three aspects:(i)How to ensure that the scheduling process meets the privacy constraints of Spark applications;(ii)How to prevent a small number of resources meeting privacy constraints from impeding scheduling;(iii)How to obtain an optimal task sequence in the DAG composed stages and tasks that can be executed independently in parallel to minimize the makespan.According to the characteristics of Spark scheduling problems in heterogeneous clusters,a corresponding mathematical model is established.Based on the existing Spark framework,a system architecture that considers application privacy constraints is designed and an intelligent algorithm framework for Spark task scheduling is proposed.The framework includes four algorithm components: task serialization,chromosome coding,population initialization,fitness function,and genetic operation.In the task scheduling part,the task ordering rules including stage priority,privacy priority,and data volume priority are proposed.In the chromosome coding part,binary chromosome coding is designed to reflect task sequence and scheduling scheme,and reference matrix is designed to ensure that individuals meet application privacy constraints.As for the fitness function part,adaptation degree function is displayed to direct individual selection.The genetic manipulation contains the following five aspects: an makespan minimization strategy is proposed to initialize one individual in order to generally direct algorithm search and evolution;other effective initial individuals are generated randomly based on the reference matrix;a selection strategy that retains the highest value of the parent's fitness is put forward;a crossover mechanism which adopts different crossover probabilities and multiple crossover methods is established based on individual fitness,to make new individuals after crossover evolve towards a higher degree of fitness;different mutation probabilities are used according to individual fitness and effective individuals are generated based on the reference matrix.To evaluate the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the parameters,then the best parameter combination is determined by means of plots.Two related algorithms are used as baseline algorithms.The performance difference is compared and analyzed between the proposed algorithm and the baseline algorithms from different aspects.Experimental results show that the proposed algorithm outperforms the compared algorithms with different job numbers and data center numbers.
Keywords/Search Tags:Spark, Scheduling optimization, Privacy data, Genetic algorithm
PDF Full Text Request
Related items