Font Size: a A A

Design And Implementation GPU Training Platform Based On YARN

Posted on:2021-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:C SuFull Text:PDF
GTID:2428330614971460Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In nowadays,while big data brings accurate analysis and production guidance to enterprises,it also brings enormous pressure to storage and computing.In order to meet the needs of massive data processing,big data computing came into being.Big data computing includes a variety of computing modes,and platforms with different computing modes have their own solutions in terms of task scheduling and resource management of clusters.As a universal task scheduling and resource management platform,YARN can support different computing modes to run on it,so that the cluster's physical resources can be managed on a unified platform.The biggest advantage of GPU over CPU is to provide high performance parallel computing.At present,the distributed resource management platform represented by YARN still has some limitations.On the one hand,although YARN provides a complete third-party platform access solution,the third-party platforms are still focused on some platforms related to big data and lack support for machine learning platforms.On the other hand,YARN only supports CPU and memory resource management,not GPU resource management,and lacks resource scheduling strategy in heterogeneous resource environment.In order to enhance the support of GPU for machine learning parallel computing,this paper designs and implements GPU training platform based on YARN,optimizes and improves the underlying resource model of YARN,adds GPU resources into the underlying resource model,and completes the management and scheduling support of YARN on GPU resources.Thus,a distributed training platform for machine learning that supports GPU scheduling is realized.The specific work of this thesis is mainly divided into the following four parts:(1)Make a comparative analysis of the existing distributed scheduling architecture and machine learning platform,and complete the selection of technology and architecture of the platform in this thesis.(2)Design and implement a YARN-based distributed training platform based on the selection results.This includes the design and implementation of YARN-based distributed platform architecture,application abstraction and protocol,and application execution process.(3)Based on the distributed training platform,realize the GPU resource management and scheduling mechanism of YARN,and complete the design ofGPU-FIFO scheduler.(4)Performance test was conducted on the platform for two performance indexes,task submission waiting time and task running time,to compare the scheduling performance of GPU-FIFO scheduler with that of the original FIFO scheduler,and the loss of running time of MPI distributed task in the physical machine environment when running MPI distributed task in YARN.The test results show that the YARN-based GPU training platform can support the running of MPI distributed training tasks on the YARN without affecting the original scheduling performance and task execution speed of the YARN.Compared with the original FIFO scheduler,the training efficiency of GPU-FIFO scheduler is obviously improved,which meets the expected requirements.
Keywords/Search Tags:YARN, Distributed Computing, GPU, Resource Management
PDF Full Text Request
Related items