Font Size: a A A

Design And Implementation Of Data Flow Monitoring And Job Scheduling Subsystem Based On Spark

Posted on:2021-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:T PengFull Text:PDF
GTID:2518306308467204Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As a fast and general-purpose computing engine designed for large-scale data processing,Apache Spark has the advantages of fast speed,more general use,and more built-in application tools.It is the current mainstream big data technology framework.Spark can completely cover a series of data development processes such as data collection,data cleaning,data analysis,and data display.For ultra-large-scale,commercial-grade big data application scenarios,the current Spark ecosystem still has problems to be improved.First of all,Spark lacks a comprehensive scheduling capability for business jobs.Job scheduling requires comprehensive consideration of various factors,such as job dependencies,job waiting time,and job urgency.The current research on big data job scheduling algorithms,its application scenarios mainly focus on how to split a big data job more reasonably or faster,and then schedule the sub-jobs after the split,which is still lacking in the Spark ecosystem for business Scheduling algorithm for data processing jobs.Secondly,the Spark ecosystem lacks data operation and maintenance monitoring support capabilities,and lacks support for data quality assurance such as data monitoring and data governance.The paper draws on Spark's DAG(Directed Acyclic Graph)logical structure and implementation,comprehensively considers cluster resource usage,job time,and user expectations,and designs and implements resource-oriented dynamic priority job scheduling based on multiple DAGs.The algorithm also provides emergency job scheduling,which solves the problems of Spark job blocking and suspending,arbitrary preemption of resources,and the failure of rapid execution of emergency jobs in the actual production process.Aiming at the problem of data quality monitoring of the whole process,the paper designed a data flow monitoring mechanism for the big data platform,which improved the data operation and maintenance monitoring capability of the Spark platform.The thesis implements two systems of Spark job scheduling and data operation and maintenance monitoring,and it is applied in a national-level big data analysis environment with good results.
Keywords/Search Tags:Spark, big data job scheduling, data quality monitoring, dynamic priority
PDF Full Text Request
Related items