Font Size: a A A

The Research Of Virtualization On General-purpose Computation On Graphic Processing Unit

Posted on:2013-01-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ShiFull Text:PDF
GTID:1228330395985106Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
System virtual machine is an important research topic of virtualization, which is thefundamental infrastructure of cloud computing. The system virtual machinetechnology successfully virtualized lots of I/O devices, while the GPU (Graph PrecessUnit) is an exception. In particular, the general precessing ability of GPU (GPGPU)was never fully virtualized in system virtual machine platform. In practice, theacademic circles and VMM industry choose to realize the GPU virtualization inhigher layer: Application Programming Interface (API). Some primitive results whichfocus on the traditional graphic API have been published. CUDA (Compute UnifiedDevice Architecture) is a brand new API directly designed for GPGPU. It provides theability to manipulate the GPU hardware without the help of Graphic API. The rise ofCUDA shows that the virtualization of graphic API is not enough for the GPGPUusing a dedicate API framework: the existing graphic API virtualization have noeffect for CUDA applications. So an independent GPU API framework calls for anindependent virtualization method.In order to improve the usability of GPGPU in virtualization environment, thispaper describes vCUDA, a general-purpose graphics processing unit (GPGPU)computing solution for virtual machines (VMs). vCUDA allows applicationsexecuting within VMs to leverage hardware acceleration, which can be beneficial tothe performance of a class of high-performance computing (HPC) applications. Thekey insights in our design include API call interception/redirection, lazy RPC, adedicated RPC system for VMs and the support for advance features of VMM.With API interception and redirection, Compute Unified Device Architecture(CUDA) applications in VMs can access a graphics hardware device and achieve highcomputing performance in a transparent way, without the modification of theapplication or operating system. The evaluation about the official examples and thirdparty applications show that vCUDA mimics the original CUDA protocol invirtualization environment, all test get the same result as the native environment.Thousands of CUDA APIs could be called in a CUDA application. If vCUDA sendevery API call to remote site at the moment the API is intercepted, the same numberof RPCs will be invoked and the overhead of excessive world switch will beinevitably introduced into the vCUDA system. vCUDA borrowed the idea from thegraphic API virtualization and adopted an optimization mechanism called Lazy RPC to improve the system performance by intelligently batching specific API calls. Therelated experiments show the Lazy RPC reduces the numbers of remote call to30%,and speed up the vCUDA performance to148%.In the current study, vCUDA achieved a near-native performance with the dedicatedRPC system, VMRPC. VMRPC is a light-weight RPC framework specificallydesigned for VMs that leverages heap and stack sharing to circumvent unnecessarydata copying and serialization/deserilization, and achieve high performance. Ourevaluation shows that the throughput of RPC has improved by two orders ofmagnitude. We carried out a detailed analysis of the performance of our framework(vCUDA+VMRPC). Using a number of unmodified official examples from CUDASDK and third-party applications in the evaluation, we observed that CUDAapplications running with vCUDA exhibited a very low performance penalty (lessthan21%) in comparison with the native environment, thereby demonstrating theviability of vCUDA architecture.vCUDA expose the device multiplex and suspend/resume function at the base ofCUDA virtualization, any CUDA application built on top of these features can run asusual in virtual machines without any modification. vCUDA develops a one-to-manymodel to multiplex GPU device in the VM. Under the coordination of the vCUDAstub, two different service threads can cooperatively manipulate one hardwareresource by connecting to a single working thread. The suspend/resume is realized bystore and restore the CUDA state while the kernel is not running. The devicemultiplex and suspend/resume tests show the performance degradation comes fromvCUDA is trivial. Base on the CUDA state tracking technology of the vCUDA, werealize the inter-kernel checkpoint scheme on GPU.
Keywords/Search Tags:GPGPU, virtualization, RPC, Xen, KVM, VMware, CUDA
PDF Full Text Request
Related items