The Design And Implementation Of The GPU Resource Management Component In Transwarp Container Platform

Posted on:2022-08-27

Degree:Master

Type:Thesis

Country:China

Candidate:C M Gong

Full Text:PDF

GTID:2518306725984029

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

As the container cloud technology represented by Docker and Kubernetes in cloud computing becomes mature,many enterprises start to containerize applications and use Kubernetes to manage containerized applications,including AI applications.However,in the current container cloud technology,the virtualization and management of GPU resources are at an initial stage.AI applications,especially deep learning applications need to leverage GPUs to accelerate computation,which brings challenges to deploying AI applications in Kubernetes.The Kubernetes system currently manages GPU resources through a device plugin provided by NVIDIA,allowing workloads deployed in Kubernetes to use GPU resources.However,this approach only supports the exclusive use of physical GPU devices by containers which brings two problems:(1)underutilized GPU resources because of those tasks which consume few GPU resources,such as model inference;(2)increased average task latency caused by high concurrency where the GPU devices of customers are limited.To solve the above problems,this thesis suggests integrating GPU sharing capability in Kubernetes so that multiple workloads can share physical GPUs,thereby increasing GPU utilization and reducing average task latency.This thesis proposed a solution called Krux which leverages the extensibility mechanism offered by Kubernetes to design how GPUs can be used and shared in Kubernetes.Based on this solution,a GPU resource management module is implemented,which can be divided into four functional modules,namely GPU device plugin module,GPU scheduling plugin module,container runtime module,and resource limitation module.The GPU device plugin module is developed based on the Kubernetes device plugin mechanism,which is used to monitor GPU devices on worker nodes and report the virtualized GPU resources to the Kubernetes cluster.The GPU scheduling plugin module is developed based on the Kubernetes scheduling framework and it works with the default scheduler to make scheduling decisions for GPU workloads.The resource limitation module is responsible for intercepting the CUDA driver API and adding additional control logic to limit the GPU resources used by CUDA applications.The container runtime module makes some customized modifications to the official NVIDIA container runtime and serves as an assistant module to connect the two modules upstream of container creation with the GPU resource limit module.The GPU resource management component has been integrated into Transwarp's container platform version 3.0.The big data and AI platforms supported by the platform have begun to widely use GPU sharing in the latest development version,effectively improving GPU utilization and reducing average task latency,which helps companies reduce the cost of hardware resources.

Keywords/Search Tags:

Cloud Computing, Deep learning, Container, Kubernetes, GPU Sharing

PDF Full Text Request

Related items

1	Design And Implementation Of Container Cloud Platform Base On Kubernetes Technology
2	Design And Implementation Of Container Cloud Platform Based On Kubernetes
3	Design And Implementation Of Kubernetes Container Cloud Platform Based On Alicloud
4	Design And Implementation Of Heterogeneous Container Runtime In Kubernetes Based Data Center
5	Design And Implementation Of Deep Learning Container Cloud Platform Based On Docker And Kubernetes
6	Design And Implementation Of A Highly Available Container Cloud Based On Kubernetes
7	Design And Implementation Of A Container Cloud Platform Based On Docker And Kubernetes
8	Research And Implementation Of Container Cloud Resource Scheduling Method For Multi-Dimensional Computing Framework
9	Design And Implementation Of Cloud Container Management System Based On Microservice Architecture
10	Key Techniques Of Hard Multi-tenancy Model In Kubernetes Based Container Cloud