Font Size: a A A

Key Techniques Research On Unified Programming Environment For Heterogeneous Parallel Systems

Posted on:2014-08-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Q XunFull Text:PDF
GTID:1228330479979540Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Providing a unified programming environment for heterogeneous parallel system is an inevitable choice for improving software productivity. We categorize the heterogeneous systems into two types to discuss the key design issues of unified programming environments. The first type is heterogeneous parallel systems whose components can be programmed in the same way. Due to the difficulty of heterogeneous parallel programming, developers proposed the Open CL programme framework as a unified programme method for different types of computing devices. Currently Open CL has been supported widespread by chip manufacturers, such as Intel, AMD, NVIDIA, Apple and so on. Although users can program different computing devices by Open CL, they still have to separately control each Open CL device. Moreover, Open CL is considerable lack of performance portability. The first three research in this thesis is about this kind of system.1. Research on how to achieving single Open CL device image on a heterogeneous system. To simplify heterogeneous parallel programming, we propose to a high level Open CL runtime VHCD which abstract multiple Open CL devices to a single one.Offline profiling is used to balance load among heterogeneous devices. A programming primitive is designed for users to define the buffer access pattern of kernels.With VHCD, programs designed for single Open CL devices can be executed effectively on systems with multiple Open CL devices by inserting several primitives.The data transfer between devices is minimized by virtual cache management in VHCD. A set of Open CL test programs are used to evaluate the VHCD, and the results show that VHCD can achieve a high performance under different hardware configurations.2. Research on automatic distributed shared Open CL memory management. We propose distributed shared Open CL memory(DSOM) to support software managed shared memory between multiple Open CL devices. DSOM allocates shared buffers in the system memory, and treats on-device memory as virtual cache buffers. To support fine grained shared buffer management, we design a kernel parser for automatic buffer access range analysis. DSOM adopts a basic modified, shared and invalid cache coherency protocol to maintain coherency between different cache buffers. We propose a novel update strategy called adaptive update to minimize data transfer between devices and launch transfer operations as early as possible.This strategy can overlap the data transfer and kernel execution. A lot of test programs are used to evaluate the applicability of buffer access analysis in DSOM, the results show that the mothod can work in most cases. And 8 programs are used to do performance evaluation, the results show that the performance of programs using DSOM is close to that of hand coded version.3. Research on performance portability of Open CL. Although Open CL supports function portability, the performance portability of it is poor. It is very hard to design a Open CL runtime which can running on any hardware platforms efficiently, because architectures of them are very different. As the first step to improve the performance portability of Open CL, we propose NOCL. NOCL is a Open CL CPU runtime, which can execution GPU optimized Open CL programs efficiently. Schedule lots of work-item on CPU incurs very large overhead. A lot of synchronizations are needed to maintain the local memory coherency between work-items, which make it worse. We try to replace all local array access by global array access, based on the observation that local arrays are always used as temporal variables. After dependence testing, NOCL can eliminate barriers in kernel to do aggressive work-item serialization. Aftere serialization, NOCL treats each work-item as a normal triple nested loop. Through vectorization and cache management, NOCL optimizes the loop. Our experiments show that NOCL achieves a significant performance boost against the Intel Open CL CPU runtime when executing GPU optimized Open CL programs.The second kind of systems discussed in this thesis is that different devices cannot be programmed by the same method. Although Open CL try to unify programming methods of different kinds of compute devices, there are still special devices cannot be programmed by Open CL, such as FPGA. Due the complexity of hardware logic design,currently it is hard to achieve high performance when programming FPGA by using high level language. Software hardware cooperation is a big challenge in application development for reconfigurable computers. We discuss how to support application development for reconfigurable computers in operating system.1. Research on implementing efficient inter-process communication, based on the hardware process in BORPH. We port the BOPRH to the each-got kind of reconfigurable platforms which consists of a FPGA board and a CPU. We design the BORPH-N which is a extension of BORPH. BORPH-N redesign the hardware process and simplify the development of hardware processes in two aspects. First, the interface between software and hardware is software API rather than hardware module. Second, the function logic can be facilitated by rich hardware library due to the standard FPGA on-chip bus. BORPH-N designs a high performance inter-process communication support, consisting of shared memory and semaphore. The experiment shows that optimization based on running independently make the communication mechanism very efficient.
Keywords/Search Tags:Heterogeneous parallel programming, OpenCL, Distributed shared memory, Reconfigurable computing, Software-hardware cooperation
PDF Full Text Request
Related items