Font Size: a A A

Research And Optimization Of YARN-Based Hybrid Structure Scheduler

Posted on:2019-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2428330566496847Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,big data technology has rapidly penetrated all walks of life and has also produced a variety of data processing needs: real-time event processing,batch processing,machine learning,and many other processing methods.In this case,Hadoop introduced YARN for resource management in its second generation(Hadoop 2.0).YARN is the solution that Hadoop provides to solve multi-dimensional requirements,transforming Hadoop from a single "bulk storage/processing" system to real Multipurpose platform.However,by analyzing such heterogeneous loads in the production environment,we find that the cluster's tasks share resources disproportionately,that is,a small number of long tasks consume most of the cluster resources.At the same time,it is also found that there are a large number of resource fragments under such loads,that is,resources that have been allocated but not yet used.Therefore,this thesis expands YARN and adopts a mixed-structure scheduling method,so that long and short tasks are handled separately,and the distributed scheduler makes use of resource fragments.This thesis studies the hybrid structure scheduling method based on YARN,that is,adding a distributed scheduler to the original scheduling system.These two kinds of schedulers have different characteristics: 1)The central scheduler can provide strict scheduling invariants(such as fairness,capacity)for heterogeneous applications;2)The distributed scheduler can provide scalable and efficient scheduling,but it is difficult to implement scheduling invariants.We use the central scheduler to schedule long tasks because the central schedule has a global view of resources and can optimize the placement of tasks from multiple dimensions.For short tasks,we use a distributed scheduler to take advantage of resource fragmentation through overallocation.After introducing the distributed scheduler,there are two scheduling paths in the entire scheduling system.The first problem we face is how to separate long and short tasks.We put this work in the application framework,because the application can better understand their own needs for resources.Without loss of generality,this thesis implements this function in the Mapreduc framework through sampling execution and regression analysis.In addition,because we use a distributed scheduler to make use of over-allocated resources,it is very likely that a node will experience congestion.For this problem,we propose an actively avoided solution that trains the model of congestion avoidance by learning job history and uses this model to guide the scheduler to abandon the decision that congestion may occur.Finally,through comparison experiments,using a variety of loads,including reproduction of real production environment loads,typical benchmark loads,and mixed loads,we verified short-term task selection modules,congestion avoidance modules,and overall performance improvements.Experiments show that the hybrid structure scheduler improves the task throughput of the cluster,thereby improving the resource usage rate and shortening the task completion time.
Keywords/Search Tags:Hadoop YARN, hybrid scheduler, over allocation, congestion avoidance
PDF Full Text Request
Related items