Font Size: a A A

Design And Implementation Of High Available Virtualized GPU Resources

Posted on:2017-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:X H XuFull Text:PDF
GTID:2428330590488881Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Graphic Process Unit(GPU)has cemented its position in modern computer systems.The application scenarios of GPU range from graphics computing,media transcoding to high performance computing.People also saw the enormous potentials for GPU to do general-purpose computing on graphics processing units(GPGPU)thanks to its parallel nature.Therefore,the cloud environment starts to import GPU as a key computing resource to provide a hybrid computing service.To this end,two full GPU virtualization solutions,gVirt and GPUvm,have been proposed recently.The current full GPU virtualization solutions,however,are quite unfledged.For example,they lack some crucial functionalities of Virtual Machine(VM)management,such as checkpointing and migration.Additionally,GPU may crash or hang due to various reasons.The current solution is to reset the GPU hardware through the mechanism provided by modern GPU vendors.When the GPU driver detects timeout,the operating system usually can recover by resetting the GPU hardware in driver while the application may end up with an unpredictable result.This approach sacrifices the execution of applications for the stability of operating system.A typical cloud environment leverages virtualization technique to consolidate multiple VMs on one physical host.However,virtualization has been always a doubled-edged sword.The benefits of consolidation come with the price of increased possibility of GPU failure.As a result,the demand for High Availability(HA)in virtualization is emphasized.In this paper,we pioneer a fast and iterative checkpointing mechanism for VM with full GPU virtualization based on gVirt.The challenges of this paper are how to define the context of a virtual GPU and how to reduce the downtime.We are the first to propose command auditing to solve the problem that GPU lacks of dirty bit mechanism.Further,we propose HAG,an open source solution,leveraging our checkpointing mechanism to back up the whole VM to another host.Hence,the backup VM can take over when the driver detects GPU hangs,which eventually guarantees the high availability of virtualized GPU resources.Since the continuous checkpointing via HAG incurs overhead,the downtime and performance degradation are analyzed.Our evaluation shows that 1)the downtime of the VM migration or backup is only 224-411 ms,and 2)different GPU workloads achieve 77%-92%of performance with the backup interval of several seconds and our solution only occupies 80-170 Mbits/Sec bandwidth during execution.
Keywords/Search Tags:virtualization, GPU, checkpointing, migration, high availability
PDF Full Text Request
Related items