Font Size: a A A

Research On Interference-aware GPU Resource Provisioning For Predictable DNN Inference

Posted on:2022-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J N XuFull Text:PDF
GTID:2518306776492894Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the burgeoning demand for latency-sensitive artificial intelligence(AI)-based computation,GPUs are essential to accelerating deep neural network(DNN)inference workloads in cloud datacenters.The traditional exclusive and temporal sharing of GPUs to execute DNN inference workloads can intrinsically result in GPU resource wastage.To fully utilize the GPU resources,spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling.Motivated by our empirical measurement study of DNN inference executed on Amazon EC2 GPU instances,we find that the performance interference among co-located inference workloads is noticeable.Through an in-depth analysis of motivation experiment results,we further identify the root causes of such interference as the severe contention of the GPU scheduler and GPU L2 cache space as well as the GPU power consumption.While existing works on guaranteeing performance Service Level Objectives(SLOs)of DNN inference focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques,how to proactively mitigate such severe performance interference has received comparatively little attention.To fill this gap,this thesis proposes i Gniter,an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud.Specifically,i Gniter comprises of two key components:(1)a lightweight DNN inference performance model,which leverages the system and workload metrics that are practically accessible to explicitly capture the performance interference with different batch sizes and GPU resources and can accurately predict DNN inference performance;(2)A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model,with the aim of achieving predictable performance of DNN inference workloads.We implement a prototype of i Gniter based on NVIDIA Triton inference server on Amazon EC2 GPU instances.Extensive prototype experiments on four representative DNN models and datasets demonstrate that i Gniter can guarantee the DNN inference performance SLOs,while saving the monetary cost by up to 25% in comparison to the stateof-the-art GPU resource provisioning strategies,yet with an acceptable runtime overhead.
Keywords/Search Tags:GPU resource provisioning, cloud-based DNN inference, predictable performance, performance interference
PDF Full Text Request
Related items