Design And Implementation Of Data Flow Monitoring And Job Scheduling Subsystem Based On Spark

Posted on:2021-10-25

Degree:Master

Type:Thesis

Country:China

Candidate:T Peng

Full Text:PDF

GTID:2518306308467204

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As a fast and general-purpose computing engine designed for large-scale data processing,Apache Spark has the advantages of fast speed,more general use,and more built-in application tools.It is the current mainstream big data technology framework.Spark can completely cover a series of data development processes such as data collection,data cleaning,data analysis,and data display.For ultra-large-scale,commercial-grade big data application scenarios,the current Spark ecosystem still has problems to be improved.First of all,Spark lacks a comprehensive scheduling capability for business jobs.Job scheduling requires comprehensive consideration of various factors,such as job dependencies,job waiting time,and job urgency.The current research on big data job scheduling algorithms,its application scenarios mainly focus on how to split a big data job more reasonably or faster,and then schedule the sub-jobs after the split,which is still lacking in the Spark ecosystem for business Scheduling algorithm for data processing jobs.Secondly,the Spark ecosystem lacks data operation and maintenance monitoring support capabilities,and lacks support for data quality assurance such as data monitoring and data governance.The paper draws on Spark's DAG(Directed Acyclic Graph)logical structure and implementation,comprehensively considers cluster resource usage,job time,and user expectations,and designs and implements resource-oriented dynamic priority job scheduling based on multiple DAGs.The algorithm also provides emergency job scheduling,which solves the problems of Spark job blocking and suspending,arbitrary preemption of resources,and the failure of rapid execution of emergency jobs in the actual production process.Aiming at the problem of data quality monitoring of the whole process,the paper designed a data flow monitoring mechanism for the big data platform,which improved the data operation and maintenance monitoring capability of the Spark platform.The thesis implements two systems of Spark job scheduling and data operation and maintenance monitoring,and it is applied in a national-level big data analysis environment with good results.

Keywords/Search Tags:

Spark, big data job scheduling, data quality monitoring, dynamic priority

PDF Full Text Request

Related items

1	Design And Implementation Of Data Quality Supervision System Based On Spark
2	Research On Memory Optimization Algorithm Based On Weight Priority Task Scheduling Strategy In Spark Platform
3	Research On Spark Task Scheduling Technology Based On Execution Time Prediction
4	The Research On Spark Task Scheduling Strategy Based On Dynamic Memory Awareness
5	Evaluated Priority Based Data Scheduling Mechanism Research In P2P Streaming Media System
6	Scheduling Big Data Tasks With Data Security And Privacy Constraints
7	Intelligent Scheduling Method To Spark Workflow With Distributed Privacy Data
8	Spark Task Scheduling With Data Skew And Deadline Constraints
9	Research On Job Scheduling In Spark Platform For Workloadmixing Data Center
10	Research On Dynamic Placement Of RDD Data For Interactive Spark Applications