Font Size: a A A

A Research On Model Scheduling System Based On Big Data Platform

Posted on:2018-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:S J PengFull Text:PDF
GTID:2348330512983273Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the increase in the amount of data and computing power,more and more data processing tasks run on the cluster.In order to further improve the flexibility of programming and the efficiency of job execution,there exists Pig,Hive and other different types of jobs based on MapReduce,and even go step further,the emergency of spark based on in-memory computing.Users only need to submit all kinds of jobs to the YARN(Yet Another Resource Negotiator),and all jobs are managed and scheduled by YARN uniformly.However,in practical applications,complex data processing tasks often require multiple operations or even different types of jobs combined.Different types of jobs have different execution characteristics,and some jobs may possess repetitive nature.During the execution of the job,with the allocation and reclamation of the resources,the available resources in the cluster will change dynamically,and the situation of competing resources between the jobs has a significant effect on the execution efficiency of the jobs and the implementation of the scheduling goals.If only submit the jobs to YARN without taking characteristics of the job and software and hardware resources into considereation,it is more likely the job will suspend or execute failed with the reason of system overload,resulting a waste of resources.Also,a simply submit operation can not do more fine-grained job management and control for lack of necessary intervention.In this thesis,we first define the characteristic parameters of the job model and resource model involved in the scheduling system,and propose a hybrid model based on this.Aiming at the acquisition and preprocessing of the scheduler parameters of the hybrid model,a missing data completion method based on the deep learning is proposed.Based on the workflow management technology,the schedulability and execution time of the job are predicted according to the amount of data processed by the job,the characteristics of the job itself and the characteristics of the collected cluster resource.And scheduling according to the forecast results,thereby reducing the failure rate of the job and improve the efficiency of the execution.For a single job,through the workflow management system to collect job execution time and job state,combined with the resource indicator information that collected by resource monitor during the execution of job,we generate historical information on a specific job.And then use the SVM algorithm to predict the future resource availability.Based on the prediction,we judge the schedulability of the job.For the DAG(Directed Acyclic Graph)job model,which is a combination of different types of jobs,the critical path of the job model is calculated based on the expected execution of the job.In the process of the model,according to the running state of the job model and resource changes,the scheduler uses workflow management technology to dynamically suspend or resume jobs,controling the execution flow of the job.On the basis of the critical path,we propose a scheduling discriminant function,and dynamically change the execution path of the job based on the discriminant function,thus improving the efficiency of the whole job model.Finally,we designs a number of job model samples,builds a cluster experimental environment,verifies the validity of the scheduling algorithm according to the forecast results,and analyzes the experimental results under various parameter settings.The experimental results show that the scheduling algorithm proposed in this paper and the scheduling system based on the algorithm both have good effect and achieve the expected design goals.
Keywords/Search Tags:Hybrid job model, Schedule algorithm, Workflow, SVM
PDF Full Text Request
Related items