Job-aware Network Scheduling For Hadoop Cluster

Posted on:2017-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z G Wang

Full Text:PDF

GTID:2348330488959846

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Recently with the explosion of data growth, data center has become a core infrastructure for big data processing. To analyze the data quickly and efficiently and extract value information, many distributed frameworks are proposed, such as Hadoop and Dryad. The frameworks split big data across the clusters of hundreds or thousands of computers, analyze each piece in parallel, then transfer the output of each piece and merge to the final result. To improve cluster utilization and to keep the duration of data processing jobs low are common goals for data centers.In data center, big data framework like Hadoop transfer a mount of data between different computation stages, which has become a key bottleneck of application performance. Optimizing the scheduling of flows can improve big data job performance. Traditional techniques are mostly flow-based scheduling, without considering the flow correlations. In this paper, we use Hadoop as a concrete example to obtain flow information during Hadoop shuffle from application layer and propose job-aware priority scheduling for data stream based on its feature.First, we observe that rich traffic demand information exists in the intermediate file and log files. This observation motivate us to obtain traffic forecasting from application layer. And such information and co-dependency can be extracted through run-time file system monitoring and file analyzing.Then we propose job-aware priority scheduling to optimize shuffle transfer with global view. The key is to allocate identical priority to flows for the same job. With priority policy, flows with high priority give priority to allocation of network resources, which makes flows of a job complete together as soon as possible to avoid job delay for a flow’s long duration. We allocate network resources for jobs from two perspectives, path management and buffer queue management at switch. In Fat-Tree topology, we introduce flow-based and spray to leverage equal cost multipath for load balance. We set priority queues and propose queue management. When a packet is received at switch, it will be enqueued according to its priority argument.Finally we implement our proposed scheme using NS-2 simulator and run contrast experiments. The results show that our scheduling can reduce the average completion time of shuffle, increase the network utilization and remarkably reduce the network transfer time for the job with the highest priority. Moreover, we simulate different settings like background traffic, delay of scheduling command and link failure to make the experiments more authentic. According to the results, we also show that job-aware priority scheduling can optimize the network transfer during shuffle phase.

Keywords/Search Tags:

Datacenter Network, Hadoop Shuffle, Priority Scheduling, Fat-Tree

PDF Full Text Request

Related items

1	Research On A Datacenter Network Traffic Scheduler Based On Dynamic Priority
2	A Research And Application Of Time-sharing Flow Scheduling For Datacenter Network
3	Research Of Optimization Of Hadoop MapReduce Shuffle Phase
4	Research And Optimization Of Job Scheduling Algorithm Based On Hadoop
5	Research Of Hadoop Job Scheduling Based On Priority And Reliability In Cloud Computing Environment
6	Research And Design Of Real-time Performance Of Job Scheduling Based On Hadoop Cluster
7	A Priority-based Scheduling Algorithm For Hadoop
8	Job Scheduling Algorithm Based On Hadoop Platform Optimization Research
9	Research On Traffic Scheduling In Datacenter Network
10	The Research Of Centralized Scheduling In WiMAX Mesh Network